Python爬蟲-爬取糗事百科段子

阿新 • • 發佈：2017-05-19

hasattr com ima .net header rfi star reason images

閑來無事，學學python爬蟲。

在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。

1.獲取糗事百科url

http://www.qiushibaike.com/hot/page/2/ 末尾2指第2頁

2.先抓取HTML頁面

import urllib
import urllib2
import re
page = 2
url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page)  #對應第2頁的url
request = urllib2.Request(url)  #發出申請
response = urllib2.urlopen(request)  #收到回應

當然這裏可能會產生error：主要有HTTPError和URLError。

產生URLError的原因可能是：

網絡無連接，即本機無法上網
連接不到特定的服務器
服務器不存在

異常捕獲解決辦法：

import urllib2

requset = urllib2.Request(‘http://www.xxxxx.com‘)
try:
    urllib2.urlopen(request)
except urllib2.URLError, e:
    print e.reason

HTTPError是URLError的子類，利用urlopen方法發出一個請求時，服務器上都會對應一個應答對象response，其中它包含一個數字”狀態碼”。舉個例子，假如response是一個”重定向”，需定位到別的地址獲取文檔，urllib2將對此進行處理。常見的狀態碼：

200：請求成功處理方式：獲得響應的內容，進行處理

202：請求被接受，但處理尚未完成處理方式：阻塞等待

204：服務器端已經實現了請求，但是沒有返回新的信息。如果客戶是用戶代理，則無須為此更新自身的文檔視圖。處理方式：丟棄

404：沒有找到處理方式：丟棄

500：服務器內部錯誤服務器遇到了一個未曾預料的狀況，導致了它無法完成對請求的處理。一般來說，這個問題都會在服務器端的源代碼出現錯誤時出現。

異常捕獲解決辦法：

import urllib2
 
req = urllib2.Request(‘http://blog.csdn.net/cqcre 
‘)
try:
    urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.code
    print e.reason

註：HTTPError是URLError的子類，在產生URLError時也會觸發產生HTTPError。因此應該先處理HTTPError。上述代碼可改寫為:

import urllib2
 
req = urllib2.Request(‘http://blog.csdn.net/cqcre‘)
try:
    urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.code
except urllib2.URLError, e:
    print e.reason
else:
    print "OK"

如果無法獲得回應，可能需要加入header模擬瀏覽器發出請求：

import urllib
import urllib2
 
page = 1
url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page)
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
headers = { ‘User-Agent‘ : user_agent }
try:
    request = urllib2.Request(url,headers = headers)   # 加入header
    response = urllib2.urlopen(request)
    print response.read()
except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason

3.分析頁面獲取段子

技術分享

如上圖所示，劃紅對勾的是不同的段子，每個段子都由<div class="article block untagged mb15" id="...">...</div>包裹起來。我們點開其中一個，獲取其中的用戶名、段子內容和點贊數這三個信息。這三個信息分別用紅、藍、黑下劃線圈起來。解析過程主要由正則表達式實現。

解析用戶名。正則表達式為：<div class="author clearfix">.*?<h2>(.*?)</h2> 上圖中用戶名稱為旖旎萌萌，處於<h2>和</h2>中間，用(.*?)代之。
解析段子內容。正則表達式為：<div.*?span>(.*?)</span> 同理，文字部分在<span>和</span>之間。<div .........span>之間的所有符號（含換行符）用.*?解決。
解析點贊數。正則表達式為：<div class="stats">.*?"number">(.*?)</i> 同理。用(.*?)代替1520。

正則表達式解釋：（參考崔慶才博客）

1）.*? 是一個固定的搭配，.和*代表可以匹配任意無限多個字符，加上？表示使用非貪婪模式進行匹配，也就是我們會盡可能短地做匹配，以後我們還會大量用到 .*? 的搭配。

2）(.*?)代表一個分組，在這個正則表達式中我們匹配了五個分組，在後面的遍歷item中，item[0]就代表第一個(.*?)所指代的內容，item[1]就代表第二個(.*?)所指代的內容，以此類推。

3）re.S 標誌代表在匹配時為點任意匹配模式，點 . 也可以代表換行符。

content = response.read().decode(‘utf-8‘)   
pattern = re.compile(‘<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div.*?span>(.*?)</span>.*?<div class="stats">.*?"number">(.*?)</i>‘,re.S)
items = re.findall(pattern,content)  # 參考python中的re模塊，作用是在content中尋找可以匹配pattern的串，即段子

但是有個問題，上面的表達式將有圖和無圖的段子都爬取下來了，但是在圖片一般不會顯示，所以需要去掉有圖的段子，只爬取無圖片的段子。需要稍微改動正則表達式。

上圖是無圖的段子html代碼，下圖是有圖的段子的html代碼：

技術分享

紅線劃的<div class="thumb">包含了圖片部分，而這條語句在無圖段子的html中是不存在的，所以利用這條語句中的“img”(上圖下劃線)來過濾段子。同時註意到這條語句處在段子內容和點贊數中間。

所以在段子內容和點贊這兩個正則語句之間加上一個(.*?)即可，這樣一來，只要檢測到包括“img”,就過濾掉。

content = response.read().decode(‘utf-8‘)
pattern = re.compile(‘<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div.*?span>(.*?)</span>(.*?)<div class="stats">.*?"number">(.*?)</i>‘  # 註意這個(.*?)
                     ,re.S)
items = re.findall(pattern,content)  # items就是根據正則表達式篩選到的字符串（html串）
for item in items:
    haveImg = re.search("img", item[2])   # 0，1，2，3分別表示用戶名，段子內容，圖片，點贊數。所以用item[2]來檢測過濾
    if not haveImg:
        print item[0], item[1], item[3]

好，以上代碼就是可以實現將一頁中的無圖段子全部爬取出來：代碼：

import urllib
import urllib2
import re
page = 2
url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page)
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
headers = {‘User-Agent‘:user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
content = response.read().decode(‘utf-8‘)
pattern = re.compile(‘<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div.*?span>(.*?)</span>(.*?)<div class="stats">.*?"number">(.*?)</i>‘
                     ,re.S)
items = re.findall(pattern,content)
for item in items:
    haveImg = re.search("img", item[2])
    if not haveImg:
        print item[0], item[1], item[3]

4.以上代碼是核心，但是略有簡陋，稍加修補：

# coding:utf-8

import urllib
import urllib2
import re

class Spider_QSBK:
    def __init__(self):
        self.page_index = 2
        self.enable = False
        self.stories = []
        self.user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
        self.headers = {‘User-Agent‘:self.user_agent}

    def getPage(self, page_index):
        url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page_index)
        try:
            request = urllib2.Request(url, headers=self.headers)
            response = urllib2.urlopen(request)
            content = response.read().decode(‘utf-8‘)
            return content
        except urllib2.URLError, e:
            print e.reason
            return None

    def getStories(self,page_index):
        content = self.getPage(page_index)
        pattern = re.compile(‘<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div.*?span>(.*?)</span>(.*?)<div class="stats">.*?"number">(.*?)</i>‘
                     ,re.S)
        items = re.findall(pattern,content)
        for item in items:
            haveImg = re.search("img", item[2])
            if not haveImg:
               self.stories.append([item[0], item[1], item[3]])
        return self.stories

    def ShowStories(self, page_index):
        self.getStories(page_index)
        for st in self.stories:
            print u"第%d頁\t發布人:%s\t點贊數:%s\n%s" %(page_index, st[0], st[2], st[1])
        del self.stories

    def start(self):
        self.enable = True
#        while self.enable:
        self.ShowStories(self.page_index)
        self.page_index += 1


spider = Spider_QSBK()
spider.start()

結果一樣：

技術分享

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

Python爬蟲-爬取糗事百科段子

Python爬蟲-爬取糗事百科段子

Python爬蟲爬取糗事百科(xpath+re)

Python 爬取糗事百科段子

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Python :爬取糗事百科段子

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

用BeautifulSoup爬取糗事百科段子

NO.33——XPath選擇器爬取糗事百科段子

利用python爬取糗事百科的用戶及段子

使用python的requests、xpath和多執行緒爬取糗事百科的段子

爬蟲實戰1--抓取糗事百科段子

Python 爬蟲系列：糗事百科最熱段子

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

爬蟲--使用scrapy爬取糗事百科並在txt文件中持久化存儲

python—多協程爬取糗事百科熱圖

案例_(多線線程)爬取糗事百科

爬取糗事百科案例

requests爬取糗事百科頁面

Scrapy框架的應用———爬取糗事百科檔案

爬取糗事百科的頁面

Python爬蟲-爬取糗事百科段子

相關推薦