python3 request 爬蟲 httplib.IncompleteRead() 問題的簡單解決方法

阿新 • • 發佈：2019-01-19

起因

在一個迴圈爬取得爬蟲中，隨機出現一個 httplib.IncompleteRead() 錯誤。

分析

查詢了許多資料之後瞭解到，這個是由於 chunked 編碼不完整導致，那麼如何解決這個問題？由於這時候其實資料我們已經拿到了，但是 http_client 認為沒有結束，所以有這麼一個錯誤。具體分析過程可以看看這篇博文很詳細。博文傳送門

處理

這裡由於專案中充斥著 Request，不方便換，而且檔案不大，所以我通過最簡單的處理方式來處理這個問題（處理方式來源），這種方式可能會帶來資料丟失（在大檔案的情況下）
自己抓取到這個錯誤
通過錯誤.partial.decode(‘utf-8’) 來獲取到檔案內容
例項程式碼：

def __init__(self, e, uri, format, uriparts):
        self.e = e
        self.uri = uri
        self.format = format
        self.uriparts = uriparts
        try:
            data = self.e.fp.read()
        except http_client.IncompleteRead as e:
            data = e.partial
        if self.e.headers.get('Content-Encoding' 
) == 'gzip':
            buf = BytesIO(data)
            f = gzip.GzipFile(fileobj=buf)
            data = f.read()
        if len(data) == 0:
            data = {}
        else:
            data = data.decode('utf-8')
            if self.format == 'json':
                try:
                    data = json.loads(data)
                except 
 ValueError:
                    pass
        self.response_data = data
        super(FanfouHTTPError, self).__init__(str(self))

本文對應程式碼

    def get_cartoon_json(self,categories):
        for category in categories:
            hasMore = True
            index = 1
            while hasMore:
                try:
                    link = self.__baseUrl + '/' + category['name']+'/'+str(index)
                    # 轉換中文 url 編碼
                    link = urllib.request.quote(link)
                    print(link)
                    # 把多餘的轉換 : ==> %3A ，還原
                    link = link.replace('%3A', ':')
                    # 開啟連結
                    conn = req.urlopen(link)
                    # 以 utf-8 編碼獲取網頁內容
                    content = conn.read().decode('utf-8')

                except http_client.IncompleteRead as e:
                    # 處理 chunked 讀取錯誤，由於這裡都是 json 所以就不再作 gzip 驗證
                    content = e.partial
                    content = content.decode('utf-8')
                    if len(content) == 0:
                        content = '{}'

                print(content)
                jsonUtils = BaiDuCartoonUtils(self.__filePath)
                jsonUtils.write_json_to_jsonFile(content, self.__filePath + category['name'] + '/list/',
                                                 'data' + str(index) + '.json')

                jsonStr = json.loads(content)
                if jsonStr['data']['hasMore'] != 1:
                    hasMore = False
                index = index + 1
                # jsonUtils.write_cartoon_to_json('/api/query/'+self.__categoryName, content)

python3 request 爬蟲 httplib.IncompleteRead() 問題的簡單解決方法

起因

分析

處理

如有問題，歡迎指正

python3 request 爬蟲 httplib.IncompleteRead() 問題的簡單解決方法

爬蟲IP被禁的簡單解決方法

爬蟲IP被禁的簡單解決方法——切換UserAgent

Python3.4出現unable to find vcvarsall.bat的簡單解決方法

Java中浮點數相減造成損失的簡單解決方法

關於[No mapping found for HTTP request with URI]的問題解決方法

WIN7部分程式中文亂碼的簡單解決方法

Nginx 出現 413 Request Entity Too Large錯誤解決方法

python3.6+selnium3+IE11問題及解決方法

413 Request Entity Too Large 的解決方法

Nginx出現413 Request Entity Too Large錯誤解決方法,phpmyadmin匯入mysql資料庫提示,您想上傳更大的檔案/您可能想上傳更大的檔案的解決方法

提示：413 Request Entity Too Large 的解決方法

python3提示sqlite3模組不存在解決方法

windows安裝centos7雙系統後丟失windows啟動項的簡單解決方法

【福利！】Android SDK安裝、更新速度慢，必須用VPN的簡單解決方法

python 爬蟲網頁亂碼問題解決方法

U盤啟動 Ubuntu 等系統，修復 PBR 引導的簡單解決方法

Unable to read the project file ... 簡單解決方法

使用Struts2繼承ActionSupport出現錯誤，簡單解決方法

Eclipse "no make found in PATH" 簡單解決方法

python3 request 爬蟲 httplib.IncompleteRead() 問題的簡單解決方法

起因

分析

處理

如有問題，歡迎指正

相關推薦