打开 tcp_tw_recycle 导致问题（二）

转贴：http://zhangle.is-a-geek.org/2010/11/tcp_tw_recycle%E5%92%8Cnat/

Van Jacobson在RFC 1323里有这么一段话

An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender's timestamp clock. Such an extension is not part of the proposal of this RFC.

Linux实现了这个机制。只是要同时启用 timestamp 和 tcp_tw_recycle。具体的实现代码在 net/ipv4/tcp_ipv4.c 里的 tcp_v4_conn_request 函数里：

     830                 if (tmp_opt.saw_tstamp &&
     831                     tcp_death_row.sysctl_tw_recycle &&
     832                     (dst = inet_csk_route_req(sk, req)) != NULL &&
     833                     (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
     834                     peer->v4daddr == saddr) {
     835                         if (xtime.tv_sec < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
     836                             (s32)(peer->tcp_ts - req->ts_recent) >
     837                                                         TCP_PAWS_WINDOW) {
     838                                 NET_INC_STATS_BH(LINUX_MIB_PAWSPASSIVEREJECTED);
     839                                 dst_release(dst);
     840                                 goto drop_and_free;
     841                         }
     842                 }
     843                 /* Kill the following clause, if you dislike this way. */
     844                 else if (!sysctl_tcp_syncookies &&
     845                          (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
     846                           (sysctl_max_syn_backlog >> 2)) &&
     847                          (!peer || !peer->tcp_ts_stamp) &&
     848                          (!dst || !dst_metric(dst, RTAX_RTT))) {
     849                         /* Without syncookies last quarter of
     850                          * backlog is filled with destinations,
     851                          * proven to be alive.
     852                          * It means that we continue to communicate
     853                          * to destinations, already remembered
     854                          * to the moment of synflood.
     855                          */
     856                         LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open "
     857                                        "request from %u.%u.%u.%u/%u\n",
     858                                        NIPQUAD(saddr),
     859                                        ntohs(skb->h.th->source));
     860                         dst_release(dst);
     861                         goto drop_and_free;
     862                 }

这个机制依赖于客户端机器的timestamp单调递增。如果服务器在负载均衡器后面，同时这个负载均衡器做了NAT且不改变数据包的 timestamp，那么有可能导致某个客户端发出的syn包被丢弃，造成连接请求超时。因为timestamp的值来自于源机器的jiffies。不同的机器开机时间很难是完全相同的。此时，除了客户端请求超时外，在服务器上还可以观察到netstat -s的结果里passive connections rejected by timestamp这一行的数值在增长。

所以在NAT后面的机器不应该启用 tcp_tw_recycle。

这里还有另外一个小插曲，请看这个表达式，这里两个数都是无符号32位整数，这里可能造成underflow，也就是前者比后者小2的31次方以上，结果就成了正数。我当时分析的时候恰恰出现了这种情况，险些不能自圆其说，囧……

(s32)(peer->tcp_ts - req->ts_recent)

我写了个补丁，想消除这种情况，可是用了我的方法就不能正确处理wrap-around，而之前之所以那么写就是为了可以正确处理wrap- around。所以恐怕除了加一点警告之外，其他的也没什么能做的了。

http://lkml.org/lkml/2010/11/14/24