1. 程式人生 > >給Python中通過urllib2.urlopen獲取網頁的過程中,新增gzip的壓縮與解壓縮支援

給Python中通過urllib2.urlopen獲取網頁的過程中,新增gzip的壓縮與解壓縮支援

之前已經實現了用Python獲取網頁的內容,相關已實現程式碼為:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 #------------------------------------------------------------------------------ # get response from url # note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request def getUrlResponse(url, postDict={}, headerDict={}) : # makesure url is string, not unicode, otherwise urllib2.urlopen will error url = str(url); if (postDict) : postData = urllib.urlencode(postDict); req = urllib2.Request(url, postData); req.add_header('Content-Type'
"application/x-www-form-urlencoded"); else : req = urllib2.Request(url); if(headerDict) : print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); req.add_header('User-Agent', gConst['userAgentIE9']); req.add_header('Cache-Control''no-cache'); req.add_header(
'Accept''*/*'); #req.add_header('Accept-Encoding', 'gzip, deflate'); req.add_header('Connection''Keep-Alive'); resp = urllib2.urlopen(req); return resp; #------------------------------------------------------------------------------ # get response html==body from url def getUrlRespHtml(url, postDict={}, headerDict={}) : resp = getUrlResponse(url, postDict, headerDict); respHtml = resp.read(); return respHtml;

其中,是不支援html的壓縮已解壓縮的。

現在想要支援相關的壓縮與解壓縮。

其中,關於這部分內容,之前就已經通過C#實現了對應的功能,瞭解了對應的邏輯。所以,此處主要是具體是如何用python實現而已,對於內部機制,基本已經瞭解過了。

【解決過程】

1.之前就簡單找過相關的帖子看,但是當時沒來得及解決。

現在知道了,是要先對http的request新增gzip的header的,具體python程式碼是:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

然後返回的http的response中,read所得到的資料,就是gzip後的壓縮的資料了。

接下來就是想要搞懂,如何將其解壓出來。

2.先去找了下gzip的解釋,發現python官方文件中,是這樣說的:

12.2. gzip — Support for gzip files

This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

The data compression is provided by the zlib module.

即,gzip是針對檔案來壓縮與解壓縮的。,而對於資料壓縮與解壓,是用zlib。

所以又去檢視zlib:

zlib.decompress(string[, wbits[, bufsize]])

Decompresses the data in string, returning a string containing the uncompressed data. Thewbits parameter controls the size of the window buffer, and is discussed further below. Ifbufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

然後程式中直接去用:zlib.decompress,結果出錯,後來解決了,具體過程見:

然後,就可以實現將返回的html解壓了。

3.參考了這裡:

才知道可以去判斷其中返回的http的response中,是否包含Content-Encoding: gzip,然後再決定是否去呼叫zlib去解壓縮的。

4.最後實現了對應的全部程式碼,如下:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 #------------------------------------------------------------------------------ # get response from url # note: if you have already used cookiejar, then here will automatically use it # while using rllib2.Request def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) : # makesure url is string, not unicode, otherwise urllib2.urlopen will error url = str(url); if (postDict) : postData = urllib.urlencode(postDict); req = urllib2.Request(url, postData); req.add_header('Content-Type'"application/x-www-form-urlencoded"); else : req = urllib2.Request(url); if(headerDict) : #print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); defHeaderDict = { 'User-Agent'    : gConst['userAgentIE9'], 'Cache-Control' 'no-cache', 'Accept'        '*/*', 'Connection'    'Keep-Alive', }; # add default headers firstly for eachDefHd in defHeaderDict.keys() : #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]); req.add_header(eachDefHd, defHeaderDict[eachDefHd]); if(useGzip) : #print "use gzip for",url; req.add_header('Accept-Encoding''gzip, deflate'); # add customized header later -> allow overwrite default header if(headerDict) : #print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); if(timeout > 0) : # set timeout value if necessary resp = urllib2.urlopen(req, timeout=timeout); else : resp = urllib2.urlopen(req); return resp; #------------------------------------------------------------------------------ # get response html==body from url #def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) : def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) : resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip); respHtml = resp.read(); if(useGzip) : #print "---before unzip, len(respHtml)=",len(respHtml); respInfo = resp.info(); # Server: nginx/1.0.8 # Date: Sun, 08 Apr 2012 12:30:35 GMT # Content-Type: text/html # Transfer-Encoding: chunked # Connection: close # Vary: Accept-Encoding # ... # Content-Encoding: gzip # sometime, the request use gzip,deflate, but actually returned is un-gzip html # -> response info not include above "Content-Encoding: gzip" # -> so here only decode when it is indeed is gziped data if( ("Content-Encoding" in respInfo) and (respInfo['Content-Encoding'== "gzip")) : respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS); #print "+++ after unzip, len(respHtml)=",len(respHtml); return respHtml;

【總結】

關於給python中的urllib2.urlopen新增gzip支援,其中主要邏輯就是:

1. 給request新增對應的gzip的header:

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

2. 然後獲得返回的html後,用zlib對其解壓縮:

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

其中解壓縮之前,先要判斷返回的內容,是否是真正的gzip後的資料,即“Content-Encoding: gzip”,因為可能出現你的http的請求中支援其返回gzip的資料,但是其返回的是原始的沒有用gzip壓縮的html資料。