給Python中通過urllib2.urlopen獲取網頁的過程中，新增gzip的壓縮與解壓縮支援

阿新 • • 發佈：2019-01-02

之前已經實現了用Python獲取網頁的內容，相關已實現程式碼為：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

#------------------------------------------------------------------------------ # get response from url # note: if you have already used cookiejar, then here will automatically use it

# while using rllib2.Request def getUrlResponse(url, postDict={}, headerDict=

{})
 :

# makesure url is string, not unicode, otherwise urllib2.urlopen will error url = str(url); if (postDict) : postData = urllib.urlencode(postDict); req = urllib2.Request(url, postData); req.add_header('Content-Type'

, "application/x-www-form-urlencoded"); else : req = urllib2.Request(url); if(headerDict) : print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); req.add_header('User-Agent', gConst['userAgentIE9']); req.add_header('Cache-Control', 'no-cache'); req.add_header(

'Accept', '*/*'); #req.add_header('Accept-Encoding', 'gzip, deflate'); req.add_header('Connection', 'Keep-Alive'); resp = urllib2.urlopen(req); return resp; #------------------------------------------------------------------------------ # get response html==body from url def getUrlRespHtml(url, postDict={}, headerDict=

{})
 :

resp = getUrlResponse(url, postDict, headerDict); respHtml = resp.read(); return respHtml;

其中，是不支援html的壓縮已解壓縮的。

現在想要支援相關的壓縮與解壓縮。

其中，關於這部分內容，之前就已經通過C#實現了對應的功能，瞭解了對應的邏輯。所以，此處主要是具體是如何用python實現而已，對於內部機制，基本已經瞭解過了。

【解決過程】

1.之前就簡單找過相關的帖子看，但是當時沒來得及解決。

現在知道了，是要先對http的request新增gzip的header的，具體python程式碼是：

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

然後返回的http的response中，read所得到的資料，就是gzip後的壓縮的資料了。

接下來就是想要搞懂，如何將其解壓出來。

2.先去找了下gzip的解釋，發現python官方文件中，是這樣說的：

12.2. gzip — Support for gzip files

This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

The data compression is provided by the zlib module.

即，gzip是針對檔案來壓縮與解壓縮的。，而對於資料壓縮與解壓，是用zlib。

所以又去檢視zlib：

zlib.decompress(string[, wbits[, bufsize]])
Decompresses the data in string, returning a string containing the uncompressed data. Thewbits parameter controls the size of the window buffer, and is discussed further below. Ifbufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

然後程式中直接去用：zlib.decompress，結果出錯，後來解決了，具體過程見：

然後，就可以實現將返回的html解壓了。

3.參考了這裡：

才知道可以去判斷其中返回的http的response中，是否包含Content-Encoding: gzip，然後再決定是否去呼叫zlib去解壓縮的。

4.最後實現了對應的全部程式碼，如下：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 #------------------------------------------------------------------------------ # get response from url # note: if you have already used cookiejar, then here will automatically use it # while using rllib2.Request def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) : # makesure url is string, not unicode, otherwise urllib2.urlopen will error url = str(url); if (postDict) : postData = urllib.urlencode(postDict); req = urllib2.Request(url, postData); req.add_header('Content-Type', "application/x-www-form-urlencoded"); else : req = urllib2.Request(url); if(headerDict) : #print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); defHeaderDict = { 'User-Agent' : gConst['userAgentIE9'], 'Cache-Control' : 'no-cache', 'Accept' : '*/*', 'Connection' : 'Keep-Alive', }; # add default headers firstly for eachDefHd in defHeaderDict.keys() : #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]); req.add_header(eachDefHd, defHeaderDict[eachDefHd]); if(useGzip) : #print "use gzip for",url; req.add_header('Accept-Encoding', 'gzip, deflate'); # add customized header later -> allow overwrite default header if(headerDict) : #print "added header:",headerDict; for key in headerDict.keys() : req.add_header(key, headerDict[key]); if(timeout > 0) : # set timeout value if necessary resp = urllib2.urlopen(req, timeout=timeout); else : resp = urllib2.urlopen(req); return resp; #------------------------------------------------------------------------------ # get response html==body from url #def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) : def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) : resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip); respHtml = resp.read(); if(useGzip) : #print "---before unzip, len(respHtml)=",len(respHtml); respInfo = resp.info(); # Server: nginx/1.0.8 # Date: Sun, 08 Apr 2012 12:30:35 GMT # Content-Type: text/html # Transfer-Encoding: chunked # Connection: close # Vary: Accept-Encoding # ... # Content-Encoding: gzip # sometime, the request use gzip,deflate, but actually returned is un-gzip html # -> response info not include above "Content-Encoding: gzip" # -> so here only decode when it is indeed is gziped data if( ("Content-Encoding" in respInfo) and (respInfo['Content-Encoding'] == "gzip")) : respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS); #print "+++ after unzip, len(respHtml)=",len(respHtml); return respHtml;

【總結】

關於給python中的urllib2.urlopen新增gzip支援，其中主要邏輯就是：

1. 給request新增對應的gzip的header：

req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

2. 然後獲得返回的html後，用zlib對其解壓縮:

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

其中解壓縮之前，先要判斷返回的內容，是否是真正的gzip後的資料，即“Content-Encoding: gzip”，因為可能出現你的http的請求中支援其返回gzip的資料，但是其返回的是原始的沒有用gzip壓縮的html資料。

給Python中通過urllib2.urlopen獲取網頁的過程中，新增gzip的壓縮與解壓縮支援

12.2. `gzip` — Support for gzip files

給Python中通過urllib2.urlopen獲取網頁的過程中，新增gzip的壓縮與解壓縮支援

linux中的常用壓縮與解壓縮命令

HDFS中的壓縮與解壓縮機制

android中image檔案的壓縮與解壓縮

linux 作業系統中壓縮與解壓縮命令的使用

7zip壓縮與解壓縮在vc++中的呼叫方法例子

Linux中壓縮與解壓縮 tar、bzip2、xz

Python的學習（二十六）---- 壓縮與解壓縮檔案

Python 中利用urllib2簡單實現網頁抓取

【Beautifulsoup】如何在網頁中通過中文text獲取標籤

Python: PySide(Qt)異步獲取網頁源碼

python爬蟲如何獲取網頁資訊時，發現所需要的資訊是動態生成的，然後抓包獲取到資訊來源的URL？

python獲取tcp連線數，新增連線數，繪圖（用於效能測試過程中監控）

spring 中通過ApplicationContext getBean獲取注入物件

VC++6.0 通過HTTP方式獲取網頁 OpenURL

JAVA中通過InetAddress類獲取主機名與IP地址

python獲取網頁page數，同時按照href批量爬取網頁（requests+BeautifulSoup）

在smarty中通過php指令碼獲取smarty變數

【Python】通過截圖匹配原圖中的位置（opencv）

Android中通過NTP伺服器獲取時間功能原始碼分析

給Python中通過urllib2.urlopen獲取網頁的過程中，新增gzip的壓縮與解壓縮支援

12.2. gzip — Support for gzip files

相關推薦

12.2. `gzip` — Support for gzip files