requests爬取中文網站的字元編碼問題

阿新 • • 發佈：2019-01-31

這兩天在一些入口網站使用requests爬資料的時候，發現列印或者儲存到檔案中的中文顯示為Unicode碼，看著十分不爽快，於是就必須網上找了一下相關問題。其實，弄明白瞭解決也很簡單了
比如，爬取鳳凰網

response= requests.get("http://www.ifeng.com/")

我們都知道response有text和content這兩個property,它們都是指響應內容，但是又有區別。我們從doc中可以看到：

text的doc內容為：

Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property.

而content的doc內容為：

Content of the response, in bytes.

其中text是unicode碼,content是位元組碼，我們獲取到的響應內容的字元編碼只取決於HTTP headers，也就是我們檢視網頁原始碼時<head>標籤下<meta>標籤中charset指定的字元編碼方式，例如：

<meta http-equiv="content-type" content="text/html;charset=utf-8">

因此，當我們使用text屬性獲取html內容出現unicode碼時，我們可以通過設定字元編碼response.encoding

，來使之匹配網頁原始碼中指定的字元編碼，這樣列印輸出就不會很奇怪了。

import requests

response = requests.get("http://www.ifeng.com/")
response.encoding = "utf-8" #手動指定字元編碼為utf-8
print(response.text)

有興趣的童鞋可以試試沒有指定字元編碼或者指定其他字元編碼的效果。有不懂的歡迎留言討論！

另外，我們使用python內建的檔案操作函式開啟文字檔案（不是二進位制檔案，注意區別）時，預設使用的platform dependent的字元編碼進行編解碼文字檔案，比如Windows中使用的是Ascii，Linux中使用的是utf-8，當然，我們再open()

的時候可以通過encoding指定字元編碼，例如：

open(fileName,"r",encoding="utf-8")

以上就是關於python在爬取中文網頁時遇到的一些小問題，記錄一下，以便幫助自己和大家。

requests爬取中文網站的字元編碼問題

requests爬取中文網站的字元編碼問題

python爬蟲案例——根據網址爬取中文網站，獲取標題、子連線、子連線數目、連線描述、中文分詞列表

Python3.5+requests 爬取網站遇到中文亂碼怎麼辦？ä½èï¼å¾®è½¯äºæ´²ç ç©¶é¢

python爬取網頁—網站編碼

Python爬蟲——4.4爬蟲案例——requests和xpath爬取招聘網站資訊

爬蟲系列3：Requests+Xpath 爬取租房網站信息並保存本地

使用requests配合【lxml+xpath】爬取B2B網站

requests爬取去哪兒網站

python2. requests爬取網上資料中文亂碼的情況處理方法

使用requests爬取貓眼電影TOP100榜單

爬取小說網站整站小說內容 -《狗嗨默示錄》-

webmagic爬取渲染網站

一個爬取法律網站的爬蟲

爬取資訊網站的新聞並保存到excel

【爬蟲】002 python3 +beautifulsoup4 +requests 爬取靜態頁面

requests爬取中國天氣網深圳七日天氣

python爬蟲-基礎入門-爬取整個網站《1》

python爬蟲-基礎入門-爬取整個網站《2》

python爬蟲-基礎入門-爬取整個網站《3》

Scrapy ：爬取培訓網站講師資訊

requests爬取中文網站的字元編碼問題

相關推薦