[Python]實戰——百度貼吧爬蟲

網絡爬蟲(Web Spiders)是一個自動提取網頁的程序，它為搜索引擎從萬維網上下載網頁，是搜索引擎的重要組成。半年前，我接觸了Lucene搜索引擎開發——開始了網絡爬蟲之旅；當時，采用的是純Java開發環境，對百度某吧幾百萬帖子進行了全面爬取。那時候我對python一點都不了解，今天對《Pyhon基礎教程：第2版·修訂版》以及極客學院相關視頻進行了學習，形成了本文。

Key Words: 網絡爬蟲, 多線程, requests, XPath, lxml

1. 本文相關環境

開發工具與平臺：Python2.7, Windows7, PyCharm

Python第三方包下載平臺：http://www.lfd.uci.edu/~gohlke/pythonlibs/，該平臺可以下載幾乎所有Python包，包括本文的requests,lxml，下載之後如何安裝（見下文）。另外，推薦一款json格式化工具：HiJson。

2. Windows平臺，如何安裝requests、lxml包？

從上述網站上，選擇合適自己的版本（Ctrl+F，可搜索），下載完成後：

①將後綴“.whl”更改為“.zip”；

②解壓，將“非-info”文件夾拷貝至“D:\AppInstall\Python27\Lib”即可。

3.requests簡單介紹

HTTP for humans.
All the cool kids are doing it. Requests is one of the most downloaded Python packages of all time.

以上是官網上的兩句原話：①人性化的HTTP; ②下載量最大的第三方包之一。事實上，requests確實很人性化，比如獲取網頁源代碼如此簡單：

GET方法：requests.get(url=, params=, kwargs=)

import requests
html = requests.get('http://tieba.baidu.com/f?ie=utf-8&kw=python')
print (html.text)

當然，有時候我們還需要偽裝一下：

User-Agent面紗：審查元素->Network->刷新頁面->選擇某文件->headers

import requests
import re

url = 'http://tieba.baidu.com/f?ie=utf-8&kw=python'
ua = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/55.0.2883.87 Safari/537.36'}
#GET請求數據
html = requests.get(url,headers=ua)
html.encoding = 'utf-8'
#正則匹配
title = re.findall('class="j_th_tit ">(.*?)</a>',html.text,re.S)

for each in title:
    print(each)

GET並不是萬能的，有些網頁需要我們向其提交數據，才能獲得反饋：

POST方法：requests.post(url=, data=http://www.ithao123.cn/, json=, kwargs=)

#構造POST字典
node = {
    'enter':'true',
    'page':'3'
}
#POST請求數據
html = requests.post(url,data=http://www.ithao123.cn/node)

4. XPath教程

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對元素和屬性進行遍歷。
具體查看W3C，http://www.w3school.com.cn/xpath/index.asp

XPath將我們從復雜的正則中解脫了出來，並且Chrome直接提供了該工具：

構造貼吧XPath：借助Chrome XPath著手，形成貼吧相應匹配

#構造XPath，正則每層樓內容
content_field = selector.xpath('//div[@class="l_post j_l_post l_post_bright  "]')
#抓取data-field裏類json數據，並去掉雙引號，最後轉為字典格式
data_field = json.loads(each.xpath('@data-field')[0].replace('"',''))
#回復內容
content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content  clearfix"]/text()')[0]

一些註意事項，由於Python3強制向後不兼容，導致了可能出現以下問題：

from lxml import etree，引入失敗。解決方案：①換個lxml版本包；②換成Python2.7
如果不使用requests，而使用from urllib import urlopen;urllib.urlopen(tempUrl)，出錯。解決方案：import urllib.request;urllib.request.urlopen(tempUrl)

5. Python多線程

這個實現起來尤為簡單，主要分為以下幾步：

①導包：from multiprocessing.dummy import Pool as ThreadPool

②多線程，本機為2核CPU：pool = ThreadPool(2)

③map Python並發：pool.map(spider,page)； pool.close(); pool.join()

6. 百度貼吧爬蟲

根據以上介紹，一款貼吧爬蟲呼之欲出，這裏對上文圖中data-field字段，格式化為json如下：

可以根據自己的需要，獲取字段中更多的內容~

#-*-coding:utf8-*-
from multiprocessing.dummy import Pool as ThreadPool
from lxml import etree
import requests
import json
import sys
#強制編碼，註意頂部聲明
reload(sys)
sys.setdefaultencoding('utf-8')

#控制臺顯示
def showContent(item):
    print(u'回帖時間:'+str(item['time'])+'\n')
    print(u'回帖內容:'+unicode(item['content'])+'\n')
    print(u'回帖人:'+item['user_name']+'\n\n')

#保存文件
def saveFile(item):
    #u''，設定Unicode編碼
    f.writelines(u'回帖時間:'+str(item['time'])+'\n')
    f.writelines(u'回帖內容:'+unicode(item['content'])+'\n')
    f.writelines(u'回帖人:'+item['user_name']+'\n\n')

#網絡爬蟲
def spider(url):
    #抓取網頁源碼
    html = requests.get(url)
    selector = etree.HTML(html.text)
    #構造XPath，正則每層樓內容
    content_field = selector.xpath('//div[@class="l_post j_l_post l_post_bright  "]')
    item = {}
    for each in content_field:
        #抓取data-field裏類json數據，並去掉雙引號，最後轉為字典格式
        data_field = json.loads(each.xpath('@data-field')[0].replace('"',''))
        #用戶名
        user_name = data_field['author']['user_name']
        #回復內容
        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content  clearfix"]/text()')[0]
        #回復時間
        time = data_field['content']['date']
        #封裝item[]
        item['user_name'] = user_name
        item['content'] = content
        item['time'] = time

        showContent(item)
        saveFile(item)

if __name__ == '__main__':
    #多線程，2核CPU
    pool = ThreadPool(2)
    #r''，避免轉義，文件以追加形式打開，不存在就新建
    f = open(r'D:\content.txt','a')
    page = []
    for i in range(1,21):
        newpage = 'http://tieba.baidu.com/p/4880125218?pn='+str(i)
        page.append(newpage)
    #map實現Python多線程：map(爬取函數,網址列表)
    results = pool.map(spider,page)
    pool.close()
    pool.join()
    f.close()  #關閉文件

Tags: 百度貼吧搜索引擎 Windows7 源代碼 humans

文章來源：

[Python]實戰——百度貼吧爬蟲

相關文章