1. 程式人生 > >TCP接收視窗的調整演算法(上)

TCP接收視窗的調整演算法(上)

我們知道TCP首部中有一個16位的接收視窗欄位,它可以告訴對端:我現在能接收多少資料。TCP的流控制主要

就是通過調整接收視窗的大小來進行的。

本文內容:分析TCP接收視窗的調整演算法,包括一些相關知識和初始接收視窗的取值。

核心版本:3.2.12

作者:zhangskd @ csdn blog

資料結構

以下是涉及到的資料結構。

struct tcp_sock {
    ...
    /* 最早接收但未確認的段的序號,即當前接收視窗的左端*/
    u32 rcv_wup; /* rcv_nxt on last window update sent */
    u16 advmss; /* Advertised MSS. 本端能接收的MSS上限,建立連線時用來通告對端*/
    u32 rcv_ssthresh; /* Current window clamp. 當前接收視窗大小的閾值*/
    u32 rcv_wnd; /* Current receiver window,當前的接收視窗大小*/
    u32 window_clamp; /* 接收視窗的最大值,這個值也會動態調整*/
    ...
}
struct tcp_options_received {
    ...
        snd_wscale : 4, /* Window scaling received from sender, 對端接收視窗擴大因子 */
        rcv_wscale : 4; /* Window scaling to send to receiver, 本端接收視窗擴大因子 */
    u16 user_mss; /* mss requested by user in ioctl */
    u16 mss_clamp; /* Maximal mss, negotiated at connection setup,對端的最大mss */
}
/**
 * struct sock - network layer representation of sockets
 * @sk_rcvbuf: size of receive buffer in bytes
 * @sk_receive_queue: incoming packets
 * @sk_write_queue: packet sending queue
 * @sk_sndbuf: size of send buffer in bytes
 */
struct sock {
    ...
    struct sk_buff_head sk_receive_queue;
    /* 表示接收佇列sk_receive_queue中所有段的資料總長度*/
#define sk_rmem_alloc sk_backlog.rmem_alloc

    int sk_rcvbuf; /* 接收緩衝區長度的上限*/
    int sk_sndbuf; /* 傳送緩衝區長度的上限*/

    struct sk_buff_head sk_write_queue;
    ...
}

struct sk_buff_head {
    /* These two members must be first. */
    struct sk_buff *next;
    struct sk_buff *prev;
    __u32 qlen;
    spinlock_t lock;
};
/**
 * inet_connection_sock - INET connection oriented sock
 * @icsk_ack: Delayed ACK control data
 */
struct inet_connection_sock {
    ...
    struct {
        ...
        /* 在快速傳送確認模式中,可以快速傳送ACK段的數量*/
        __u8 quick; /* Scheduled number of quick acks */
        /* 由最近接收到的段計算出的對端傳送MSS */
        __16 rcv_mss; /* MSS used for delayed ACK decisions */
    } icsk_ack;
    ...
}
struct tcphdr {
    __be16 source;
    __be16 dest;
    __be32 seq;
    __be32 ack_seq;

#if defined (__LITTLE_ENDIAN_BITFIELD)
    __u16 resl : 4,
          doff : 4,
          fin : 1,
          syn : 1,
          rst : 1,
          psh : 1,
          ack : 1,
          urg : 1,
          ece : 1,
          cwr : 1;

#elif defined (__BIG_ENDIAN_BITFIELD)
    __u16 doff : 4,
          resl : 4,
          cwr : 1,
          ece : 1,
          urg : 1,
          ack : 1,
          psh : 1,
          rst : 1,
          syn : 1,
          fin : 1;
#else
#error "Adjust your <asm/byteorder.h> defines"
#endif
    __be16 window; /* 接收視窗,在這邊呢 */
    __sum16 check;
    __be16 urg_ptr;
}

傳送視窗和接收視窗的更新:

MSS

先來看下MSS,它在接收視窗的調整中扮演著重要角色。

通過MSS (Max Segment Size),資料被分割成TCP認為合適傳送的資料塊,稱為段(Segment)。

注意:這裡說的段(Segment)不包括協議首部,只包含資料!

與MSS最為相關的一個引數就是網路裝置介面的MTU(Max Transfer Unit)。

兩臺主機之間的路徑MTU並不一定是個常數,它取決於當時所選的路由。而選路不一定是對稱

的(從A到B的路由和從B到A的路由不同)。因此路徑MTU在兩個方向上不一定是對稱的。

所以,從A到B的有效MSS、從B到A的有效MSS是動態變化的,並且可能不相同。

每個端同時具有幾個不同的MSS:

(1)tp->advmss

本端在建立連線時使用的MSS,是本端能接收的MSS上限。

這是從路由快取中獲得的(dst->metrics[RTAX_ADVMSS - 1]),一般是1460。

(2)tp->rx_opt.mss_clamp

對端的能接收的MSS上限,min(tp->rx_opt.user_mss, 對端在建立連線時通告的MSS)。

(3)tp->mss_cache

本端當前有效的傳送MSS。顯然不能超過對端接收的上限,tp->mss_cache <= tp->mss_clamp。

(4)tp->rx_opt.user_mss

使用者通過TCP_MAXSEG選項設定的MSS上限,用於決定本端和對端的接收MSS上限。

(5)icsk->icsk_ack.rcv_mss

對端有效的傳送MSS的估算值。顯然不能超過本端接收的上限,icsk->icsk_ack.rcv_mss <= tp->advmss。

Receive buffer

接收快取sk->sk_rcvbuf分為兩部分:

(1) network buffer,一般佔3/4,這部分是協議能夠使用的。

(2)application buffer,一般佔1/4。

我們在計算連線可用接收快取的時候,並不會使用整個的sk_rcvbuf,防止應用程式讀取資料的速度比

網路資料包到達的速度慢時,接收快取被耗盡的情況。

以下是詳細的說明:

The idea is not to use a complete receive buffer space to calculate the receive buffer.

We reserve some space as an application buffer, and the rest is used to queue incoming data segments.

An application buffer corresponds to the space that should compensate for the delay in time it takes for

an application to read from the socket buffer.

If the application is reading more slowly than the rate at which data are arriving, data will be queued in

the receive buffer. In order to avoid queue getting full, we advertise less receive window so that the sender

can slow down the rate of data transmission and by that time the application gets a chance to read data

from the receiver buffer.

一個包含X位元組資料的skb的最小真實記憶體消耗(truesize):

/* return minimum truesize of one skb containing X bytes of data,這裡的X包含協議頭 */
#define SKB_TRUESIZE(X) ((X) +  \
                    SKB_DATA_ALIGN(sizeof(struct sk_buff)) + \
                    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))

接收視窗的初始化

從最簡單的開始,先來看下接收視窗的初始值、接收視窗擴大因子是如何取值的。

/* Determine a window scaling and initial window to offer.
 * Based on the assumption that the given amount of space will be offered.
 * Store the results in the tp structure.
 * NOTE: for smooth operation initial space offering should be a multiple of mss
 * if possible. We assume here that mss >= 1. This MUST be enforced by all calllers.
 */

void tcp_select_initial_window (int __space, __u32 mss, __u32 *rcv_wnd, __u32 *window_clamp,
                                int wscale_ok, __u8 *rcv_wscale, __u32 init_rcv_wnd)
{
    unsigned int space = (__space < 0 ? 0 : __space); /* 接收快取不能為負*/

    /* If no clamp set the clamp to the max possible scaled window。
     * 如果接收視窗上限的初始值為0,則把它設成最大。
     */
    if (*window_clamp == 0)
        (*window_clamp) = (65535 << 14); /*這是接收視窗的最大上限*/
 
    /* 接收視窗不能超過它的上限 */
    space = min(*window_clamp, space); 

    /* Quantize space offering to a multiple of mss if possible.
     * 接收視窗大小最好是mss的整數倍。
     */
    if (space > mss)
        space = (space / mss) * mss; /* 讓space為mss的整數倍*/
 
    /* NOTE: offering an initial window larger than 32767 will break some
     * buggy TCP stacks. If the admin tells us it is likely we could be speaking
     * with such a buggy stack we will truncate our initial window offering to
     * 32K - 1 unless the remote has sent us a window scaling option, which
     * we interpret as a sign the remote TCP is not misinterpreting the window
     * field as a signed quantity.
     */
    /* 當協議使用有符號的接收視窗時,則接收視窗大小不能超過32767*/
    if (sysctl_tcp_workaround_signed_windows)
        (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
    esle
        (*rcv_wnd) = space;
 
    (*rcv_wscale) = 0;
    /* 計算接收視窗擴大因子rcv_wscale,需要多大才能表示本連線的最大接收視窗大小?*/
    if (wscale_ok) {
        /* Set window scaling on max possible window
         * See RFC1323 for an explanation of the limit to 14
         * tcp_rmem[2]為接收緩衝區長度上限的最大值,用於調整sk_rcvbuf。
          * rmem_max為系統接收視窗的最大大小。
          */
        space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
        space = min_t(u32, space, *window_clamp); /*受限於具體連線*/

        while (space > 65535 && (*rcv_wscale) < 14) {
            space >>= 1;
            (*rcv_wscale)++;
        }
   }
 
    /* Set initial window to a value enough for senders starting with initial
     * congestion window of TCP_DEFAULT_INIT_RCVWND. Place a limit on the 
     * initial window when mss is larger than 1460.
     *
     * 接收視窗的初始值在這裡確定,一般是10個數據段大小左右。
     */
    if (mss > (1 << *rcv_wscale)) {
        int init_cwnd = TCP_DEFAULT_INIT_RCVWND; /* 10 */
        if (mss > 1460)
            init_cwnd = max_t(u32, 1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
        
        /* when initializing use the value from init_rcv_wnd rather than the 
         * default from above.
         * 決定初始接收視窗時,先考慮路由快取中的,如果沒有,再考慮系統預設的。
          */
        if (init_rcv_wnd) /* 如果路由快取中初始接收視窗大小不為0*/
            *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
        else 
            *rcv_wnd = min(*rcv_wnd, init_cwnd *mss);
    }
 
    /* Set the clamp no higher than max representable value */
    (*window_clamp) = min(65535 << (*rcv_wscale), *window_clamp);
}

初始的接收視窗的取值(mss的整數倍):

(1)先考慮路由快取中的RTAX_INITRWND

(2)在考慮系統預設的TCP_DEFAULT_INIT_RCVWND(10)

(3)最後考慮min(3/4 * sk_rcvbuf, window_clamp),如果這個值很低

視窗擴大因子的取值:

接收視窗取最大值為max(tcp_rmem[2], rmem_max),本連線接收視窗的最大值為

min(max(tcp_rmem[2], rmem_max), window_clamp)。

那麼我們需要多大的視窗擴大因子,才能用16位來表示最大的接收視窗呢?

如果接收視窗的最大值受限於tcp_rmem[2] = 4194304,那麼rcv_wscale = 7,視窗擴大倍數為128。

傳送SYN/ACK時的呼叫路徑:tcp_v4_send_synack -> tcp_make_synack -> tcp_select_initial_window。

/* Prepare a SYN-ACK. */
struct sk_buff *tcp_make_synack (struct sock *sk, struct dst_entry *dst, 
                                 struct request_sock *req, struct request_values *rvp)
{
    struct inet_request_sock *ireq = inet_rsk(req);
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th;
    struct sk_buff *skb;
    ...
    mss = dst_metric_advmss(dst); /*路由快取中的mss*/
    /*如果使用者有特別設定,則取其小者*/
    if (tp->rx_opt.user_mss && tp->rx_opt.user_mss < mss)
        mss = tp->rx_opt.user_mss;
 
    if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
        __u8 rcv_wscale;

        /* Set this up on the first call only */
        req->window_clamp = tp->window_clamp ? : dst_metric(dst, RTAX_WINDOW);

        /* limit the window selection if the user enforce a smaller rx buffer */
        if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && 
            (req->window_clamp > tcp_full_space(sk) || req->window_clamp == 0))
            req->window_clamp = tcp_full_space(sk);
 
        /* tcp_full_space because it is guaranteed to be the first packet */
        tcp_select_initial_window(tcp_full_space(sk), 
                            mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
                            &req->rcv_wnd,
                            &req->window_clamp,
                            ireq->wscale_ok,
                            &rcv_wscale,
                            dst_metric(dst, RTAX_INITRWND));

        ireq->rcv_wscale = rcv_wscale;
    }
    ...
}