Меню Рубрики

Linux time wait tcp

3 необычных кейса о сетевой подсистеме Linux

В этой статье представлены три небольшие истории, которые произошли в нашей практике: в разное время и в разных проектах. Объединяет их то, что они связаны с сетевой подсистемой Linux (Reverse Path Filter, TIME_WAIT, multicast) и иллюстрируют, как глубоко зачастую приходится анализировать инцидент, с которым сталкиваешься впервые, чтобы решить возникшую проблему… и, конечно, какую радость можно испытать в результате полученного решения.

История первая: о Reverse Path Filter

Клиент с большой корпоративной сетью решил пропускать часть своего интернет-трафика через единый корпоративный файрвол, расположенный за маршрутизатором центрального подразделения. С помощью iproute2 трафик, уходящий в интернет, был направлен в центральное подразделение, где уже было настроено несколько таблиц маршрутизации. Добавив дополнительную таблицу маршрутизации и настроив в ней маршруты перенаправления на файрвол, мы включили перенаправление трафика из других филиалов и… трафик не пошел.


Схема прохождения трафика через таблицы и цепочки Netfilter

Начали выяснять, почему не работает настроенная маршрутизация. На входящем туннельном интерфейсе маршрутизатора трафик обнаруживался:

Однако на исходящем интерфейсе пакетов не было. Стало ясно, что фильтруются они на маршрутизаторе, однако явно установленных правил отбрасывания пакетов в iptables не было. Поэтому мы начали последовательно, по мере прохождения трафика, устанавливать правила, отбрасывающие наши пакеты и после установки смотреть счетчики:

Проверили последовательно nat PREROUTING, mangle PREROUTING. В mangle FORWARD счетчик не увеличивался, а значит — пакеты теряются на этапе маршрутизации. Проверив снова маршруты и правила, начали изучать, что именно происходит на этом этапе.

В ядре Linux для каждого интерфейса по умолчанию включен параметр Reverse Path Filtering ( rp_filter ). В случае, когда вы используете сложную, асимметричную маршрутизацию и пакет с ответом будет возвращаться в источник не тем маршрутом, которым пришел пакет-запрос, Linux будет отфильтровывать такой трафик. Для решения этой задачи необходимо отключить Reverse Path Filtering для всех ваших сетевых устройств, принимающих участие в маршрутизации. Чуть ниже простой и быстрый способ сделать это для всех имеющихся у вас сетевых устройств:

Возвращаясь к кейсу, мы решили проблему, отключив Reverse Path Filter для интерфейса tap0 и теперь хорошим тоном на маршрутизаторах считаем отключение rp_filter для всех устройств, принимающих участие в асимметричном роутинге.

История вторая: о TIME_WAIT

В обслуживаемом нами высоконагруженном веб-проекте возникла необычная проблема: от 1 до 3 процентов пользователей не могли получить доступ к сайту. При изучении проблемы мы выяснили, что недоступность никак не коррелировала с загрузкой любых системных ресурсов (диск, память, сеть и т.д.), не зависела от местоположения пользователя или его оператора связи. Единственное, что объединяло всех пользователей, которые испытывали проблемы, — они выходили в интернет через NAT.

Состояние TIME_WAIT в протоколе TCP позволяет системе убедиться в том, что в данном TCP-соединении действительно прекращена передача данных и никакие данные не были потеряны. Но возможное количество одновременно открытых сокетов — величина конечная, а значит — это ресурс, который тратится в том числе и на состояние TIME_WAIT , в котором не выполняется обслуживание клиента.


Механизм закрытия TCP-соединения

Разгадка, как и ожидалось, нашлась в документации ядра. Естественное желание администратора highload-системы — уменьшить «холостое» потребление ресурсов. Беглое гугление покажет нам множество советов, которые призывают включить опции ядра Linux tcp_tw_reuse и tcp_tw_recycle . Но с tcp_tw_recycle не всё так просто, как могло показаться.

Разберемся с этими параметрами подробнее:

    Параметр tcp_tw_reuse полезно включить в борьбе за ресурсы, занимаемые TIME_WAIT . TCP-соединение идентифицируется по набору параметров IP1_Port1_IP2_Port2 . Когда сокет переходит в состояние TIME_WAIT , при отключенном tcp_tw_reuse установка нового исходящего соединения будет происходить с выбором нового локального IP1_Port1 . Старые значения могут быть использованы только тогда, когда TCP-соединение окажется в состоянии CLOSED . Если ваш сервер создает множество исходящих соединений, установите tcp_tw_reuse = 1 и ваша система сможет использовать порты TIME_WAIT в случае исчерпания свободных. Для установки впишите в /etc/sysctl.conf :

И выполните команду:

Параметр tcp_tw_recycle призван сократить время нахождения сокета в состоянии TIME_WAIT . По умолчанию это время равно 2*MSL (Maximum Segment Lifetime), а MSL, согласно RFC 793, рекомендуется устанавливать в 2 минуты. Включив tcp_tw_recycle , вы говорите ядру Linux, чтобы оно использовало в качестве MSL не константу, а рассчитало его на основе особенностей именно вашей сети. Как правило (если у вас не dial-up), включение tcp_tw_recycle значительно сокращает время нахождения соединения в состоянии TIME_WAIT . Но здесь есть подводный камень: перейдя в состояние TIME_WAIT , ваш сетевой стек при включенном tcp_tw_recycle будет отвергать все пакеты с IP второй стороны, участвовавшей в соединении. Это может вызывать ряд проблем с доступностью при работе из-за NAT, с чем мы и столкнулись в приведенном выше кейсе. Проблема крайне сложно диагностируется и не имеет простой процедуры воспроизведения/повторяемости, поэтому мы рекомендуем проявлять крайнюю осторожность при использовании tcp_tw_recycle . Если же вы решили ее включить, внесите в /etc/sysctl.conf одну строку и (не забудьте выполнить sysctl -p ):

История третья: об OSPF и мультикастовом трафике

Обслуживаемая корпоративная сеть была построена на базе tinc VPN и прилегающими к ней лучами IPSec и OVPN-соединений. Для маршрутизации всего этого адресного пространства L3 мы использовали OSPF. На одном из узлов, куда агрегировалось большое количество каналов, мы обнаружили, что небольшая часть сетей, несмотря на верную конфигурацию OSPF, периодически пропадает из таблицы маршрутов на этом узле.


Упрощенное устройство VPN-сети, используемой в описываемом проекте

В первую очередь проверили связь с маршрутизаторами проблемных сетей. Связь была стабильной:

Продиагностировав OSPF, мы удивились еще больше. На узле, где наблюдались проблемы, маршрутизаторы проблемных сетей отсутствовали в списке соседей. На другой стороне проблемный маршрутизатор в списке соседей присутствовал:

Следующим этапом исключили возможные проблемы с доставкой ospf hello от 172.24.0.1. Запросы от него приходили, а вот ответы — не уходили:

Никаких ограничений в iptables не было установлено — выяснили, что пакет отбрасывается уже после прохождения всех таблиц в Netfilter. Снова углубились в чтение документации, где и был обнаружен параметр ядра igmp_max_memberships , который ограничивает количество multicast-соединений для одного сокета. По умолчанию это количество равно 20. Мы, для круглого числа, увеличили его до 42 — работа OSPF нормализовалась:

Заключение

Какой бы сложной ни была проблема, она всегда решаема и зачастую — с помощью изучения документации. Буду рад увидеть в комментариях описание вашего опыта поиска решения сложных и необычных проблем.

Источник

Coping with the TCP TIME-WAIT state on busy Linux servers

Vincent Bernat February 24, 2014

Do not enable net.ipv4.tcp_tw_recycle —it doesn’t even exist anymore since Linux 4.12.

The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse do. This lack of documentation opens the path to numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, as stated by the tcp(7) manual page, the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you:

Enable fast recycling of TIME-WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).

Notably, fiddling with net.netfilter.nf_conntrack_tcp_timeout_time_wait won’t change anything on how the TCP stack will handle the TIME-WAIT state.

I will provide here a more detailed explanation on how to properly handle the TIME-WAIT state. Also, keep in mind we are looking at the TCP stack of Linux. This is completely unrelated to Netfilter connection tracking which may be tweaked in other ways. 1

About the TIME-WAIT state#

This diagram is licensed under the LaTeX Project Public License 1.3. The original file is available on this page.

Let’s rewind a bit and have a close look at this TIME-WAIT state. What is it? See the TCP state diagram below: 2

Only the end closing the connection first will reach the TIME-WAIT state. The other end will follow a path which usually permits it to quickly get rid of the connection.

You can have a look at the current state of connections with ss -tan :

Purpose#

There are two purposes for the TIME-WAIT state:

The first workaround proposed in RFC 1337 is to ignore RST segments in the TIME-WAIT state. This behavior is controlled by net.ipv4.rfc1337 which is not enabled by default on Linux because this is not a complete solution to the problem described in the RFC.

  • The most known one is to prevent delayed segments from one connection being accepted by a later connection relying on the same quadruplet (source address, source port, destination address, destination port). The sequence number also needs to be in a certain range to be accepted. This narrows a bit the problem but it still exists, especially on fast connections with large receive windows. RFC 1337 explains in details what happens when the TIME-WAIT state is deficient. 3 Here is an example of what could be avoided if the TIME-WAIT state wasn’t shortened:

While in the LAST-ACK state, a connection will retransmit the last FIN segment until it gets the expected ACK segment. Therefore, it is unlikely we stay long in this state.

  • The other purpose is to ensure the remote end has closed the connection. When the last ACK is lost, the remote end stays in the LAST-ACK state. 4 Without the TIME-WAIT state, a connection could be reopened while the remote end still thinks the previous connection is valid. When it receives a SYN segment (and the sequence number matches), it will answer with a RST as it is not expecting such a segment. The new connection will be aborted with an error:

There have been propositions to turn this into a tunable value but it has been refused on the ground the TIME-WAIT state is a good thing.

Problems#

Now, let’s see why this state can be annoying on a server handling a lot of connections. There are three aspects of the problem:

  • the slot taken in the connection table preventing new connections of the same kind,
  • the memory occupied by the socket structure in the kernel, and
  • the additional CPU usage.

The result of ss -tan state time-wait | wc -l is not a problem per se!

Connection table slot#

A connection in the TIME-WAIT state is kept for one minute in the connection table. This means, another connection with the same quadruplet (source address, source port, destination address, destination port) cannot exist.

For a web server, the destination address and the destination port are likely to be constant. If your web server is behind an L7 load-balancer, the source address will also be constant. On Linux, the client port is by default allocated in a port range of about 30,000 ports (this can be changed by tuning net.ipv4.ip_local_port_range ). This means that only 30,000 connections can be established between the web server and the load-balancer every minute, so about 500 connections per second.

If the TIME-WAIT sockets are on the client side, such a situation is easy to detect. The call to connect() will return EADDRNOTAVAIL and the application will log some error message about that. On the server side, this is more complex as there is no log and no counter to rely on. In doubt, you should just try to come with something sensible to list the number of used quadruplets:

On the client side, older kernels also have to find a free local tuple (source address and source port) for each outgoing connection. Increasing the number of server ports or IP won’t help in this case. Linux 3.2 is recent enough to be able to share the same local tuple for different destinations. Thanks to Willy Tarreau for his insight on this aspect.

The solution is more quadruplets. 5 This can be done in several ways (in the order of difficulty to setup):

To avoid EADDRINUSE errors, the load balancer needs to use the SO_REUSEADDR option before calling bind() , then connect() .

This last solution may seem a bit dumb since you could just use more ports but some servers are not able to be configured this way. The before last solution can also be quite cumbersome to setup, depending on the load-balancing software, but uses less IP than the last solution.

  • use more client ports by setting net.ipv4.ip_local_port_range to a wider range,
  • use more server ports by asking the web server to listen to several additional ports (81, 82, 83, …),
  • use more client IP by configuring additional IP on the load balancer and use them in a round-robin fashion, 6
  • use more server IP by configuring additional IP on the web server. 7

A last solution is to tweak net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle . Don’t do that yet, we will cover these settings later.

Memory#

With many connections to handle, leaving a socket open for one additional minute may cost your server some memory. For example, if you want to handle about 10,000 new connections per second, you will have about 600,000 sockets in the TIME-WAIT state. How much memory does it represent? Not that much!

First, from the application point of view, a TIME-WAIT socket does not consume any memory: the socket has been closed. In the kernel, a TIME-WAIT socket is present in three structures (for three different purposes):

The use of a dedicated memory structure for sockets in the TIME-WAIT is here since Linux 2.6.14. The struct sock_common structure is a bit more verbose and I won’t copy it here.

Since Linux 4.1, the way TIME-WAIT sockets are tracked has been modified to increase performance and parallelism. The death row is now just a hash table.

A hash table of connections, named the “TCP established hash table” (despite containing connections in other states) is used to locate an existing connection, for example when receiving a new segment.

Each bucket of this hash table contains both a list of connections in the TIME-WAIT state and a list of regular active connections. The size of the hash table depends on the system memory and is printed at boot:

It is possible to override it by specifying the number of entries on the kernel command line with the thash_entries parameter.

Each element of the list of connections in the TIME-WAIT state is a struct tcp_timewait_sock , while the type for other states is struct tcp_sock : 8

A set of lists of connections, called the “death row”, is used to expire the connections in the TIME-WAIT state. They are ordered by how much time left before expiration.

It uses the same memory space as for the entries in the hash table of connections. This is the struct hlist_node tw_death_node member of struct inet_timewait_sock . 9

A hash table of bound ports, holding the locally bound ports and the associated parameters, is used to determine if it is safe to listen to a given port or to find a free port in the case of dynamic bind. The size of this hash table is the same as the size of the hash table of connections:

Each element is a struct inet_bind_socket . There is one element for each locally bound port. A TIME-WAIT connection to a web server is locally bound to the port 80 and shares the same entry as its sibling TIME-WAIT connections. On the other hand, a connection to a remote service is locally bound to some random port and does not share its entry.

We are only concerned by the space occupied by struct tcp_timewait_sock and struct inet_bind_socket . There is one struct tcp_timewait_sock for each connection in the TIME-WAIT state, inbound or outbound. There is one dedicated struct inet_bind_socket for each outbound connection and none for an inbound connection.

A struct tcp_timewait_sock is only 168 bytes while a struct inet_bind_socket is 48 bytes:

If you have about 40,000 inbound connections in the TIME-WAIT state, it should eat less than 10 MiB of memory. If you have about 40,000 outbound connections in the TIME-WAIT state, you need to account for 2.5 MiB of additional memory. Let’s check that by looking at the output of slabtop . Here is the result on a server with about 50,000 connections in the TIME-WAIT state, 45,000 of which are outbound connections:

There is nothing to change here: the memory used by TIME-WAIT connections is really small. If your server need to handle thousands of new connections per second, you need far more memory to be able to efficiently push data to clients. The overhead of TIME-WAIT connections is negligible.

On the CPU side, searching for a free local port can be a bit expensive. The work is done by the inet_csk_get_port() function which uses a lock and iterate on locally bound ports until a free port is found. A large number of entries in this hash table is usually not a problem if you have a lot of outbound connections in the TIME-WAIT state (like ephemeral connections to a memcached server): the connections usually share the same profile, the function will quickly find a free port as it iterates on them sequentially.

Other solutions#

If you still think you have a problem with TIME-WAIT connections after reading the previous section, there are three additional solutions to solve them:

  • disable socket lingering,
  • net.ipv4.tcp_tw_reuse , and
  • net.ipv4.tcp_tw_recycle .

Socket lingering#

When close() is called, any remaining data in the kernel buffers will be sent in the background and the socket will eventually transition to the TIME-WAIT state. The application can continue to work immediately and assume that all data will eventually be safely delivered.

However, an application can choose to disable this behavior, known as socket lingering. There are two flavors:

In the first one, any remaining data will be discarded and instead of closing the connection with the normal four-packet connection termination sequence, the connection will be closed with a RST (and therefore, the peer will detect an error) and will be immediately destroyed. No TIME-WAIT state in this case.

With the second flavor, if there is any data still remaining in the socket send buffer, the process will sleep when calling close() until either all the data is sent and acknowledged by the peer or the configured linger timer expires. It is possible for a process to not sleep by setting the socket as non-blocking. In this case, the same process happens in the background. It permits the remaining data to be sent during a configured timeout but if the data is successfully sent, the normal close sequence is run and you get a TIME-WAIT state. And on the other case, you’ll get the connection close with a RST and the remaining data is discarded.

In both cases, disabling socket lingering is not a one-size-fits-all solution. It may be used by some applications like HAProxy or Nginx when it is safe to use from the upper protocol point of view. There are good reasons to not disable it unconditionally.

net.ipv4.tcp_tw_reuse #

The TIME-WAIT state prevents delayed segments to be accepted in an unrelated connection. However, on certain conditions, it is possible to assume a new connection’s segment cannot be misinterpreted with an old connection’s segment.

RFC 1323 presents a set of TCP extensions to improve performance over high-bandwidth paths. Among other things, it defines a new TCP option carrying two four-byte timestamp fields. The first one is the current value of the timestamp clock of the TCP sending the option while the second one is the most recent timestamp received from the remote host.

By enabling net.ipv4.tcp_tw_reuse , Linux will reuse an existing connection in the TIME-WAIT state for a new outgoing connection if the new timestamp is strictly bigger than the most recent timestamp recorded for the previous connection: an outgoing connection in the TIME-WAIT state can be reused after just one second.

How is it safe? The first purpose of the TIME-WAIT state was to avoid duplicate segments to be accepted in an unrelated connection. Thanks to the use of timestamps, such a duplicate segments will come with an outdated timestamp and therefore be discarded.

The second purpose was to ensure the remote end is not in the LAST-ACK state because of the lost of the last ACK. The remote end will retransmit the FIN segment until:

  1. it gives up (and tear down the connection), or
  2. it receives the ACK it is waiting (and tear down the connection), or
  3. it receives a RST (and tear down the connection).

If the FIN segments are received in a timely manner, the local end socket will still be in the TIME-WAIT state and the expected ACK segments will be sent.

Once a new connection replaces the TIME-WAIT entry, the SYN segment of the new connection is ignored (thanks to the timestamps) and won’t be answered by a RST but only by a retransmission of the FIN segment. The FIN segment will then be answered with a RST (because the local connection is in the SYN-SENT state) which will allow the transition out of the LAST-ACK state. The initial SYN segment will eventually be resent (after one second) because there was no answer and the connection will be established without apparent error, except a slight delay:

It should be noted that when a connection is reused, the TWRecycled counter is increased (despite its name).

net.ipv4.tcp_tw_recycle #

When the server closes the connection first, it gets the TIME-WAIT state while the client will consider the corresponding quadruplet free and hence may reuse it for a new connection.

This mechanism also relies on the timestamp option but affects both incoming and outgoing connections which is handy when the server usually closes the connection first. 10

To keep the same guarantees the TIME-WAIT state was providing, while reducing the expiration timer, when a connection enters the TIME-WAIT state, the latest timestamp is remembered in a dedicated structure containing various metrics for previous known destinations. Then, Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp, unless the TIME-WAIT state would have expired:

Starting from Linux 4.10 (commit 95a22caee396), Linux will randomize timestamp offsets for each connection, making this option completely broken, with or without NAT . It has been completely removed from Linux 4.12.

The LAST-ACK state is handled in the exact same way as for net.ipv4.tcp_tw_recycle .

Summary#

The universal solution is to increase the number of possible quadruplets by using, for example, more server ports. This will allow you to not exhaust the possible connections with TIME-WAIT entries.

On the server side, do not enable net.ipv4.tcp_tw_recycle unless you are pretty sure you will never have NAT devices in the mix . Enabling net.ipv4.tcp_tw_reuse is useless for incoming connections.

On the client side, enabling net.ipv4.tcp_tw_reuse is another almost-safe solution. Enabling net.ipv4.tcp_tw_recycle in addition to net.ipv4.tcp_tw_reuse is mostly useless.

Moreover, when designing protocols, don’t let clients close first. Clients won’t have to deal with the TIME-WAIT state pushing the responsibility to servers which are better suited to handle this.

The TIME_WAIT state is our friend and is there to help us (i.e., to let old duplicate segments expire in the network). Instead of trying to avoid the state, we should understand it.

Notably, fiddling with net.netfilter.nf_conntrack_tcp_timeout_time_wait won’t change anything on how the TCP stack will handle the TIME-WAIT state. ↩︎

This diagram is licensed under the LaTeX Project Public License 1.3. The original file is available on this page. ↩︎

The first workaround proposed in RFC 1337 is to ignore RST segments in the TIME-WAIT state. This behavior is controlled by net.ipv4.rfc1337 which is not enabled by default on Linux because this is not a complete solution to the problem described in the RFC. ↩︎

While in the LAST-ACK state, a connection will retransmit the last FIN segment until it gets the expected ACK segment. Therefore, it is unlikely we stay long in this state. ↩︎

On the client side, older kernels also have to find a free local tuple (source address and source port) for each outgoing connection. Increasing the number of server ports or IP won’t help in this case. Linux 3.2 is recent enough to be able to share the same local tuple for different destinations. Thanks to Willy Tarreau for his insight on this aspect. ↩︎

To avoid EADDRINUSE errors, the load balancer needs to use the SO_REUSEADDR option before calling bind() , then connect() . ↩︎

This last solution may seem a bit dumb since you could just use more ports but some servers are not able to be configured this way. The before last solution can also be quite cumbersome to setup, depending on the load-balancing software, but uses less IP than the last solution. ↩︎

The use of a dedicated memory structure for sockets in the TIME-WAIT is here since Linux 2.6.14. The struct sock_common structure is a bit more verbose and I won’t copy it here. ↩︎

Since Linux 4.1, the way TIME-WAIT sockets are tracked has been modified to increase performance and parallelism. The death row is now just a hash table. ↩︎

When the server closes the connection first, it gets the TIME-WAIT state while the client will consider the corresponding quadruplet free and hence may reuse it for a new connection. ↩︎

Источник


Adblock
detector