1. 程式人生 > >Linux核心--網路棧實現分析(二)--資料包的傳遞過程(上)

Linux核心--網路棧實現分析(二)--資料包的傳遞過程(上)

本文分析基於Linux Kernel 1.2.13

作者:閆明

注:標題中的”(上)“,”(下)“表示分析過程基於資料包的傳遞方向:”(上)“表示分析是從底層向上分析、”(下)“表示分析是從上向下分析。

上一篇博文中我們從巨集觀上分析了Linux核心中網路棧的初始化過程,這裡我們再從巨集觀上分析一下一個資料包在各網路層的傳遞的過程。

我們知道網路的OSI模型和TCP/IP模型層次結構如下:


上文中我們看到了網路棧的層次結構:


我們就從最底層開始追溯一個數據包的傳遞流程。

1、網路介面層

* 硬體監聽物理介質,進行資料的接收,當接收的資料填滿了緩衝區,硬體就會產生中斷,中斷產生後,系統會轉向中斷服務子程式。

* 在中斷服務子程式中,資料會從硬體的緩衝區複製到核心的空間緩衝區,幷包裝成一個數據結構(sk_buff),然後呼叫對驅動層的介面函式netif_rx()將資料包傳送給鏈路層。該函式的實現在net/inet/dev.c中,(在整個網路棧實現中dev.c檔案的作用重大,它銜接了其下的驅動層和其上的網路層,可以稱它為鏈路層模組的實現)

該函式的實現如下:

/*
 *	Receive a packet from a device driver and queue it for the upper
 *	(protocol) levels.  It always succeeds. This is the recommended 
 *	interface to use.
 *    從裝置驅動層接受到的資料傳送到協議的
 *    上層,該函式實際是一個介面。
 */

void netif_rx(struct sk_buff *skb)
{
	static int dropping = 0;

	/*
	 *	Any received buffers are un-owned and should be discarded
	 *	when freed. These will be updated later as the frames get
	 *	owners.
	 */
	skb->sk = NULL;
	skb->free = 1;
	if(skb->stamp.tv_sec==0)
		skb->stamp = xtime;

	/*
	 *	Check that we aren't overdoing things.
	 */

	if (!backlog_size)
  		dropping = 0;
	else if (backlog_size > 300)
		dropping = 1;

	if (dropping) 
	{
		kfree_skb(skb, FREE_READ);
		return;
	}

	/*
	 *	Add it to the "backlog" queue. 
	 */
#ifdef CONFIG_SKB_CHECK
	IS_SKB(skb);
#endif	
	skb_queue_tail(&backlog,skb);//加入佇列backlog
	backlog_size++;
  
	/*
	 *	If any packet arrived, mark it for processing after the
	 *	hardware interrupt returns.
	 */

	mark_bh(NET_BH);//下半部分bottom half技術可以減少中斷處理程式的執行時間
	return;
}

該函式中用到了bootom half技術,該技術的原理是將中斷處理程式人為的分為兩部分,上半部分是實時性要求較高的任務,後半部分可以稍後完成,這樣就可以節省中斷程式的處理時間。可整體的提高系統的效能。該技術將會在後續的博文中詳細分析。

我們從上一篇分析中知道,在網路棧初始化的時候,已經將NET的下半部分執行函式定義成了net_bh(在socket.c檔案中1375行左右)

bh_base[NET_BH].routine= net_bh;//設定NET 下半部分的處理函式為net_bh

* 函式net_bh的實現在net/inet/dev.c中
/*
 *	When we are called the queue is ready to grab, the interrupts are
 *	on and hardware can interrupt and queue to the receive queue a we
 *	run with no problems.
 *	This is run as a bottom half after an interrupt handler that does
 *	mark_bh(NET_BH);
 */
 
void net_bh(void *tmp)
{
	struct sk_buff *skb;
	struct packet_type *ptype;
	struct packet_type *pt_prev;
	unsigned short type;

	/*
	 *	Atomically check and mark our BUSY state. 
	 */

	if (set_bit(1, (void*)&in_bh))//標記BUSY狀態
		return;

	/*
	 *	Can we send anything now? We want to clear the
	 *	decks for any more sends that get done as we
	 *	process the input.
	 */

	dev_transmit();//呼叫dev_tinit()函式傳送資料
  
	/*
	 *	Any data left to process. This may occur because a
	 *	mark_bh() is done after we empty the queue including
	 *	that from the device which does a mark_bh() just after
	 */

	cli();//防止佇列操作錯誤,需要關中斷和開中斷
	
	/*
	 *	While the queue is not empty
	 */
	 
	while((skb=skb_dequeue(&backlog))!=NULL)//出隊直到佇列為空
	{
		/*
		 *	We have a packet. Therefore the queue has shrunk
		 */
  		backlog_size--;//佇列元素個數減一

		sti();
		
	       /*
		*	Bump the pointer to the next structure.
		*	This assumes that the basic 'skb' pointer points to
		*	the MAC header, if any (as indicated by its "length"
		*	field).  Take care now!
		*/

		skb->h.raw = skb->data + skb->dev->hard_header_len;
		skb->len -= skb->dev->hard_header_len;

	       /*
		* 	Fetch the packet protocol ID.  This is also quite ugly, as
		* 	it depends on the protocol driver (the interface itself) to
		* 	know what the type is, or where to get it from.  The Ethernet
		* 	interfaces fetch the ID from the two bytes in the Ethernet MAC
		*	header (the h_proto field in struct ethhdr), but other drivers
		*	may either use the ethernet ID's or extra ones that do not
		*	clash (eg ETH_P_AX25). We could set this before we queue the
		*	frame. In fact I may change this when I have time.
		*/
		
		type = skb->dev->type_trans(skb, skb->dev);//取出該資料包所屬的協議型別

		/*
		 *	We got a packet ID.  Now loop over the "known protocols"
		 *	table (which is actually a linked list, but this will
		 *	change soon if I get my way- FvK), and forward the packet
		 *	to anyone who wants it.
		 *
		 *	[FvK didn't get his way but he is right this ought to be
		 *	hashed so we typically get a single hit. The speed cost
		 *	here is minimal but no doubt adds up at the 4,000+ pkts/second
		 *	rate we can hit flat out]
		 */
		pt_prev = NULL;
		for (ptype = ptype_base; ptype != NULL; ptype = ptype->next) //遍歷ptype_base所指向的網路協議佇列
		{
		    //判斷協議號是否匹配
			if ((ptype->type == type || ptype->type == htons(ETH_P_ALL)) && (!ptype->dev || ptype->dev==skb->dev))
			{
				/*
				 *	We already have a match queued. Deliver
				 *	to it and then remember the new match
				 */
				if(pt_prev)
				{
					struct sk_buff *skb2;

					skb2=skb_clone(skb, GFP_ATOMIC);//複製資料包結構

					/*
					 *	Kick the protocol handler. This should be fast
					 *	and efficient code.
					 */

					if(skb2)
						pt_prev->func(skb2, skb->dev, pt_prev);//呼叫相應協議的處理函式,
											//這裡和網路協議的種類有關係
											//如IP 協議的處理函式就是ip_rcv
				}
				/* Remember the current last to do */
				pt_prev=ptype;
			}
		} /* End of protocol list loop */
		
		/*
		 *	Is there a last item to send to ?
		 */

		if(pt_prev)
			pt_prev->func(skb, skb->dev, pt_prev);
		/*
		 * 	Has an unknown packet has been received ?
		 */
	 
		else
			kfree_skb(skb, FREE_WRITE);

		/*
		 *	Again, see if we can transmit anything now. 
		 *	[Ought to take this out judging by tests it slows
		 *	 us down not speeds us up]
		 */

		dev_transmit();
		cli();
  	}	/* End of queue loop */
  	
  	/*
  	 *	We have emptied the queue
  	 */
  	 
  	in_bh = 0;//BUSY狀態還原
	sti();
	
	/*
	 *	One last output flush.
	 */
	 
	dev_transmit();//清空緩衝區
}
2、網路層
* 就以IP資料包為例來說明,那麼從鏈路層向網路層傳遞時將呼叫ip_rcv函式。該函式完成本層的處理後會根據IP首部中使用的傳輸層協議來呼叫相應協議的處理函式。

UDP對應udp_rcv、TCP對應tcp_rcv、ICMP對應icmp_rcv、IGMP對應igmp_rcv(雖然這裡的ICMP,IGMP一般成為網路層協議,但是實際上他們都封裝在IP協議裡面,作為傳輸層對待)

這個函式比較複雜,後續會詳細分析。這裡貼上一下,讓我們對整體瞭解更清楚

/*
 *	This function receives all incoming IP datagrams.
 */

int ip_rcv(struct sk_buff *skb, struct device *dev, struct packet_type *pt)
{
	struct iphdr *iph = skb->h.iph;
	struct sock *raw_sk=NULL;
	unsigned char hash;
	unsigned char flag = 0;
	unsigned char opts_p = 0;	/* Set iff the packet has options. */
	struct inet_protocol *ipprot;
	static struct options opt; /* since we don't use these yet, and they
				take up stack space. */
	int brd=IS_MYADDR;
	int is_frag=0;
#ifdef CONFIG_IP_FIREWALL
	int err;
#endif	

	ip_statistics.IpInReceives++;

	/*
	 *	Tag the ip header of this packet so we can find it
	 */

	skb->ip_hdr = iph;

	/*
	 *	Is the datagram acceptable?
	 *
	 *	1.	Length at least the size of an ip header
	 *	2.	Version of 4
	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
	 *	(4.	We ought to check for IP multicast addresses and undefined types.. does this matter ?)
	 */

	if (skb->len<sizeof(struct iphdr) || iph->ihl<5 || iph->version != 4 ||
		skb->len<ntohs(iph->tot_len) || ip_fast_csum((unsigned char *)iph, iph->ihl) !=0)
	{
		ip_statistics.IpInHdrErrors++;
		kfree_skb(skb, FREE_WRITE);
		return(0);
	}
	
	/*
	 *	See if the firewall wants to dispose of the packet. 
	 */

#ifdef	CONFIG_IP_FIREWALL
	
	if ((err=ip_fw_chk(iph,dev,ip_fw_blk_chain,ip_fw_blk_policy, 0))!=1)
	{
		if(err==-1)
			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0, dev);
		kfree_skb(skb, FREE_WRITE);
		return 0;	
	}

#endif
	
	/*
	 *	Our transport medium may have padded the buffer out. Now we know it
	 *	is IP we can trim to the true length of the frame.
	 */

	skb->len=ntohs(iph->tot_len);

	/*
	 *	Next analyse the packet for options. Studies show under one packet in
	 *	a thousand have options....
	 */

	if (iph->ihl != 5)
	{  	/* Fast path for the typical optionless IP packet. */
		memset((char *) &opt, 0, sizeof(opt));
		if (do_options(iph, &opt) != 0)
			return 0;
		opts_p = 1;
	}

	/*
	 *	Remember if the frame is fragmented.
	 */
	 
	if(iph->frag_off)
	{
		if (iph->frag_off & 0x0020)
			is_frag|=1;
		/*
		 *	Last fragment ?
		 */
	
		if (ntohs(iph->frag_off) & 0x1fff)
			is_frag|=2;
	}
	
	/*
	 *	Do any IP forwarding required.  chk_addr() is expensive -- avoid it someday.
	 *
	 *	This is inefficient. While finding out if it is for us we could also compute
	 *	the routing table entry. This is where the great unified cache theory comes
	 *	in as and when someone implements it
	 *
	 *	For most hosts over 99% of packets match the first conditional
	 *	and don't go via ip_chk_addr. Note: brd is set to IS_MYADDR at
	 *	function entry.
	 */

	if ( iph->daddr != skb->dev->pa_addr && (brd = ip_chk_addr(iph->daddr)) == 0)
	{
		/*
		 *	Don't forward multicast or broadcast frames.
		 */

		if(skb->pkt_type!=PACKET_HOST || brd==IS_BROADCAST)
		{
			kfree_skb(skb,FREE_WRITE);
			return 0;
		}

		/*
		 *	The packet is for another target. Forward the frame
		 */

#ifdef CONFIG_IP_FORWARD
		ip_forward(skb, dev, is_frag);
#else
/*		printk("Machine %lx tried to use us as a forwarder to %lx but we have forwarding disabled!\n",
			iph->saddr,iph->daddr);*/
		ip_statistics.IpInAddrErrors++;
#endif
		/*
		 *	The forwarder is inefficient and copies the packet. We
		 *	free the original now.
		 */

		kfree_skb(skb, FREE_WRITE);
		return(0);
	}
	
#ifdef CONFIG_IP_MULTICAST	

	if(brd==IS_MULTICAST && iph->daddr!=IGMP_ALL_HOSTS && !(dev->flags&IFF_LOOPBACK))
	{
		/*
		 *	Check it is for one of our groups
		 */
		struct ip_mc_list *ip_mc=dev->ip_mc_list;
		do
		{
			if(ip_mc==NULL)
			{	
				kfree_skb(skb, FREE_WRITE);
				return 0;
			}
			if(ip_mc->multiaddr==iph->daddr)
				break;
			ip_mc=ip_mc->next;
		}
		while(1);
	}
#endif
	/*
	 *	Account for the packet
	 */
	 
#ifdef CONFIG_IP_ACCT
	ip_acct_cnt(iph,dev, ip_acct_chain);
#endif	

	/*
	 * Reassemble IP fragments.
	 */

	if(is_frag)
	{
		/* Defragment. Obtain the complete packet if there is one */
		skb=ip_defrag(iph,skb,dev);
		if(skb==NULL)
			return 0;
		skb->dev = dev;
		iph=skb->h.iph;
	}
	
		 

	/*
	 *	Point into the IP datagram, just past the header.
	 */

	skb->ip_hdr = iph;
	skb->h.raw += iph->ihl*4;
	
	/*
	 *	Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies.
	 */
	 
	hash = iph->protocol & (SOCK_ARRAY_SIZE-1);
	
	/* If there maybe a raw socket we must check - if not we don't care less */
	if((raw_sk=raw_prot.sock_array[hash])!=NULL)
	{
		struct sock *sknext=NULL;
		struct sk_buff *skb1;
		raw_sk=get_sock_raw(raw_sk, hash,  iph->saddr, iph->daddr);
		if(raw_sk)	/* Any raw sockets */
		{
			do
			{
				/* Find the next */
				sknext=get_sock_raw(raw_sk->next, hash, iph->saddr, iph->daddr);
				if(sknext)
					skb1=skb_clone(skb, GFP_ATOMIC);
				else
					break;	/* One pending raw socket left */
				if(skb1)
					raw_rcv(raw_sk, skb1, dev, iph->saddr,iph->daddr);
				raw_sk=sknext;
			}
			while(raw_sk!=NULL);
			/* Here either raw_sk is the last raw socket, or NULL if none */
			/* We deliver to the last raw socket AFTER the protocol checks as it avoids a surplus copy */
		}
	}
	
	/*
	 *	skb->h.raw now points at the protocol beyond the IP header.
	 */

	hash = iph->protocol & (MAX_INET_PROTOS -1);
	for (ipprot = (struct inet_protocol *)inet_protos[hash];ipprot != NULL;ipprot=(struct inet_protocol *)ipprot->next)
	{
		struct sk_buff *skb2;

		if (ipprot->protocol != iph->protocol)
			continue;
       /*
	* 	See if we need to make a copy of it.  This will
	* 	only be set if more than one protocol wants it.
	* 	and then not for the last one. If there is a pending
	*	raw delivery wait for that
	*/
		if (ipprot->copy || raw_sk)
		{
			skb2 = skb_clone(skb, GFP_ATOMIC);
			if(skb2==NULL)
				continue;
		}
		else
		{
			skb2 = skb;
		}
		flag = 1;

	       /*
		* Pass on the datagram to each protocol that wants it,
		* based on the datagram protocol.  We should really
		* check the protocol handler's return values here...
		*/
		ipprot->handler(skb2, dev, opts_p ? &opt : 0, iph->daddr,
				(ntohs(iph->tot_len) - (iph->ihl * 4)),
				iph->saddr, 0, ipprot);

	}

	/*
	 * All protocols checked.
	 * If this packet was a broadcast, we may *not* reply to it, since that
	 * causes (proven, grin) ARP storms and a leakage of memory (i.e. all
	 * ICMP reply messages get queued up for transmission...)
	 */

	if(raw_sk!=NULL)	/* Shift to last raw user */
		raw_rcv(raw_sk, skb, dev, iph->saddr, iph->daddr);
	else if (!flag)		/* Free and report errors */
	{
		if (brd != IS_BROADCAST && brd!=IS_MULTICAST)
			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0, dev);
		kfree_skb(skb, FREE_WRITE);
	}

	return(0);
}

3、傳輸層

如果在IP資料報的首部標明的是使用TCP傳輸資料,則在上述函式中會呼叫tcp_rcv函式。該函式的大體處理流程為:

“所有使用TCP 協議的套接字對應sock 結構都被掛入tcp_prot 全域性變量表示的proto 結構之sock_array 陣列中,採用以本地埠號為索引的插入方式,所以當tcp_rcv 函式接收到一個數據包,在完成必要的檢查和處理後,其將以TCP 協議首部中目的埠號(對於一個接收的資料包而言,其目的埠號就是本地所使用的埠號)為索引,在tcp_prot 對應sock 結構之sock_array 陣列中得到正確的sock 結構佇列,在輔之以其他條件遍歷該佇列進行對應sock 結構的查詢,在得到匹配的sock 結構後,將資料包掛入該sock 結構中的快取佇列中(由sock 結構中receive_queue 欄位指向),從而完成資料包的最終接收。”

該函式的實現也會比較複雜,這是由TCP協議的複雜功能決定的。附程式碼如下:

/*
 *	A TCP packet has arrived.
 */
 
int tcp_rcv(struct sk_buff *skb, struct device *dev, struct options *opt,
	unsigned long daddr, unsigned short len,
	unsigned long saddr, int redo, struct inet_protocol * protocol)
{
	struct tcphdr *th;
	struct sock *sk;
	int syn_ok=0;
	
	if (!skb) 
	{
		printk("IMPOSSIBLE 1\n");
		return(0);
	}

	if (!dev) 
	{
		printk("IMPOSSIBLE 2\n");
		return(0);
	}
  
	tcp_statistics.TcpInSegs++;
  
	if(skb->pkt_type!=PACKET_HOST)
	{
	  	kfree_skb(skb,FREE_READ);
	  	return(0);
	}
  
	th = skb->h.th;

	/*
	 *	Find the socket.
	 */

	sk = get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);

	/*
	 *	If this socket has got a reset it's to all intents and purposes 
  	 *	really dead. Count closed sockets as dead.
  	 *
  	 *	Note: BSD appears to have a bug here. A 'closed' TCP in BSD
  	 *	simply drops data. This seems incorrect as a 'closed' TCP doesn't
  	 *	exist so should cause resets as if the port was unreachable.
  	 */
  	 
	if (sk!=NULL && (sk->zapped || sk->state==TCP_CLOSE))
		sk=NULL;

	if (!redo) 
	{
		if (tcp_check(th, len, saddr, daddr )) 
		{
			skb->sk = NULL;
			kfree_skb(skb,FREE_READ);
			/*
			 *	We don't release the socket because it was
			 *	never marked in use.
			 */
			return(0);
		}
		th->seq = ntohl(th->seq);

		/* See if we know about the socket. */
		if (sk == NULL) 
		{
			/*
			 *	No such TCB. If th->rst is 0 send a reset (checked in tcp_reset)
			 */
			tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);
			skb->sk = NULL;
			/*
			 *	Discard frame
			 */
			kfree_skb(skb, FREE_READ);
			return(0);
		}

		skb->len = len;
		skb->acked = 0;
		skb->used = 0;
		skb->free = 0;
		skb->saddr = daddr;
		skb->daddr = saddr;
	
		/* We may need to add it to the backlog here. */
		cli();
		if (sk->inuse) 
		{
			skb_queue_tail(&sk->back_log, skb);
			sti();
			return(0);
		}
		sk->inuse = 1;
		sti();
	}
	else
	{
		if (sk==NULL) 
		{
			tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);
			skb->sk = NULL;
			kfree_skb(skb, FREE_READ);
			return(0);
		}
	}


	if (!sk->prot) 
	{
		printk("IMPOSSIBLE 3\n");
		return(0);
	}


	/*
	 *	Charge the memory to the socket. 
	 */
	 
	if (sk->rmem_alloc + skb->mem_len >= sk->rcvbuf) 
	{
		kfree_skb(skb, FREE_READ);
		release_sock(sk);
		return(0);
	}

	skb->sk=sk;
	sk->rmem_alloc += skb->mem_len;

	/*
	 *	This basically follows the flow suggested by RFC793, with the corrections in RFC1122. We
	 *	don't implement precedence and we process URG incorrectly (deliberately so) for BSD bug
	 *	compatibility. We also set up variables more thoroughly [Karn notes in the
	 *	KA9Q code the RFC793 incoming segment rules don't initialise the variables for all paths].
	 */

	if(sk->state!=TCP_ESTABLISHED)		/* Skip this lot for normal flow */
	{
	
		/*
		 *	Now deal with unusual cases.
		 */
	 
		if(sk->state==TCP_LISTEN)
		{
			if(th->ack)	/* These use the socket TOS.. might want to be the received TOS */
				tcp_reset(daddr,saddr,th,sk->prot,opt,dev,sk->ip_tos, sk->ip_ttl);

			/*
			 *	We don't care for RST, and non SYN are absorbed (old segments)
			 *	Broadcast/multicast SYN isn't allowed. Note - bug if you change the
			 *	netmask on a running connection it can go broadcast. Even Sun's have
			 *	this problem so I'm ignoring it 
			 */
			   
			if(th->rst || !th->syn || th->ack || ip_chk_addr(daddr)!=IS_MYADDR)
			{
				kfree_skb(skb, FREE_READ);
				release_sock(sk);
				return 0;
			}
		
			/*	
			 *	Guess we need to make a new socket up 
			 */
		
			tcp_conn_request(sk, skb, daddr, saddr, opt, dev, tcp_init_seq());
		
			/*
			 *	Now we have several options: In theory there is nothing else
			 *	in the frame. KA9Q has an option to send data with the syn,
			 *	BSD accepts data with the syn up to the [to be] advertised window
			 *	and Solaris 2.1 gives you a protocol error. For now we just ignore
			 *	it, that fits the spec precisely and avoids incompatibilities. It
			 *	would be nice in future to drop through and process the data.
			 */
			 
			release_sock(sk);
			return 0;
		}
	
		/* retransmitted SYN? */
		if (sk->state == TCP_SYN_RECV && th->syn && th->seq+1 == sk->acked_seq)
		{
			kfree_skb(skb, FREE_READ);
			release_sock(sk);
			return 0;
		}
		
		/*
		 *	SYN sent means we have to look for a suitable ack and either reset
		 *	for bad matches or go to connected 
		 */
	   
		if(sk->state==TCP_SYN_SENT)
		{
			/* Crossed SYN or previous junk segment */
			if(th->ack)
			{
				/* We got an ack, but it's not a good ack */
				if(!tcp_ack(sk,th,saddr,len))
				{
					/* Reset the ack - its an ack from a 
					   different connection  [ th->rst is checked in tcp_reset()] */
					tcp_statistics.TcpAttemptFails++;
					tcp_reset(daddr, saddr, th,
						sk->prot, opt,dev,sk->ip_tos,sk->ip_ttl);
					kfree_skb(skb, FREE_READ);
					release_sock(sk);
					return(0);
				}
				if(th->rst)
					return tcp_std_reset(sk,skb);
				if(!th->syn)
				{
					/* A valid ack from a different connection
					   start. Shouldn't happen but cover it */
					kfree_skb(skb, FREE_READ);
					release_sock(sk);
					return 0;
				}
				/*
				 *	Ok.. it's good. Set up sequence numbers and
				 *	move to established.
				 */
				syn_ok=1;	/* Don't reset this connection for the syn */
				sk->acked_seq=th->seq+1;
				sk->fin_seq=th->seq;
				tcp_send_ack(sk->sent_seq,sk->acked_seq,sk,th,sk->daddr);
				tcp_set_state(sk, TCP_ESTABLISHED);
				tcp_options(sk,th);
				sk->dummy_th.dest=th->source;
				sk->copied_seq = sk->acked_seq;
				if(!sk->dead)
				{
					sk->state_change(sk);
					sock_wake_async(sk->socket, 0);
				}
				if(sk->max_window==0)
				{
					sk->max_window = 32;
					sk->mss = min(sk->max_window, sk->mtu);
				}
			}
			else
			{
				/* See if SYN's cross. Drop if boring */
				if(th->syn && !th->rst)
				{
					/* Crossed SYN's are fine - but talking to
					   yourself is right out... */
					if(sk->saddr==saddr && sk->daddr==daddr &&
						sk->dummy_th.source==th->source &&
						sk->dummy_th.dest==th->dest)
					{
						tcp_statistics.TcpAttemptFails++;
						return tcp_std_reset(sk,skb);
					}
					tcp_set_state(sk,TCP_SYN_RECV);
					
					/*
					 *	FIXME:
					 *	Must send SYN|ACK here
					 */
				}		
				/* Discard junk segment */
				kfree_skb(skb, FREE_READ);
				release_sock(sk);
				return 0;
			}
			/*
			 *	SYN_RECV with data maybe.. drop through
			 */
			goto rfc_step6;
		}

	/*
	 *	BSD has a funny hack with TIME_WAIT and fast reuse of a port. There is
	 *	a more complex suggestion for fixing these reuse issues in RFC1644
	 *	but not yet ready for general use. Also see RFC1379.
	 */
	
#define BSD_TIME_WAIT
#ifdef BSD_TIME_WAIT
		if (sk->state == TCP_TIME_WAIT && th->syn && sk->dead && 
			after(th->seq, sk->acked_seq) && !th->rst)
		{
			long seq=sk->write_seq;
			if(sk->debug)
				printk("Doing a BSD time wait\n");
			tcp_statistics.TcpEstabResets++;	   
			sk->rmem_alloc -= skb->mem_len;
			skb->sk = NULL;
			sk->err=ECONNRESET;
			tcp_set_state(sk, TCP_CLOSE);
			sk->shutdown = SHUTDOWN_MASK;
			release_sock(sk);
			sk=get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);
			if (sk && sk->state==TCP_LISTEN)
			{
				sk->inuse=1;
				skb->sk = sk;
				sk->rmem_alloc += skb->mem_len;
				tcp_conn_request(sk, skb, daddr, saddr,opt, dev,seq+128000);
				release_sock(sk);
				return 0;
			}
			kfree_skb(skb, FREE_READ);
			return 0;
		}
#endif	
	}

	/*
	 *	We are now in normal data flow (see the step list in the RFC)
	 *	Note most of these are inline now. I'll inline the lot when
	 *	I have time to test it hard and look at what gcc outputs 
	 */
	
	if(!tcp_sequence(sk,th,len,opt,saddr,dev))
	{
		kfree_skb(skb, FREE_READ);
		release_sock(sk);
		return 0;
	}

	if(th->rst)
		return tcp_std_reset(sk,skb);
	
	/*
	 *	!syn_ok is effectively the state test in RFC793.
	 */
	 
	if(th->syn && !syn_ok)
	{
		tcp_reset(daddr,saddr,th, &tcp_prot, opt, dev, skb->ip_hdr->tos, 255);
		return tcp_std_reset(sk,skb);	
	}

	/*
	 *	Process the ACK
	 */
	 

	if(th->ack && !tcp_ack(sk,th,saddr,len))
	{
		/*
		 *	Our three way handshake failed.
		 */
		 
		if(sk->state==TCP_SYN_RECV)
		{
			tcp_reset(daddr, saddr, th,sk->prot, opt, dev,sk->ip_tos,sk->ip_ttl);
		}
		kfree_skb(skb, FREE_READ);
		release_sock(sk);
		return 0;
	}
	
rfc_step6:		/* I'll clean this up later */

	/*
	 *	Process urgent data
	 */
	 	
	if(tcp_urg(sk, th, saddr, len))
	{
		kfree_skb(skb, FREE_READ);
		release_sock(sk);
		return 0;
	}
	
	
	/*
	 *	Process the encapsulated data
	 */
	
	if(tcp_data(skb,sk, saddr, len))
	{
		kfree_skb(skb, FREE_READ);
		release_sock(sk);
		return 0;
	}

	/*
	 *	And done
	 */	
	
	release_sock(sk);
	return 0;
}

4、應用層

當用戶需要接收資料時,首先根據檔案描述符inode得到socket結構和sock結構,然後從sock結構中指向的佇列recieve_queue中讀取資料包,將資料包COPY到使用者空間緩衝區。資料就完整的從硬體中傳輸到使用者空間。這樣也完成了一次完整的從下到上的傳輸。