1. 程式人生 > >Python爬蟲實戰(三) — 微信文章爬蟲

Python爬蟲實戰(三) — 微信文章爬蟲

前言

最近煩心事挺多的,能讓我得到快樂的是一行行能夠執行的程式碼,那麼今天為大家帶來微信文章爬取實戰。

本篇目標

  1. 根據關鍵詞搜尋微信文章,並提取文章連結
  2. 自動儲存微信文章,並儲存為HTML格式
  3. 實現設定提取文章數目,並提供有關互動操作

快速開始

1.確定URL連結格式

首先開啟搜狗微信搜尋平臺,任意搜尋一個感興趣的關鍵詞,觀察網址http://weixin.sogou.com/weixin?type=2&query=%E5%B0%8F%E7%B1%B3&ie=utf8&s_from=input&_sug_=n
&_sug_type_=1&w=01015002&oq=&ri=0&sourceid=sugg&sut=0
&sst0=1520605349656&lkt=0%2C0%2C0&p=40040108


這裡我輸入的是小米,可以觀察到,連結中type欄位控制檢索資訊型別,query欄位則對應於我們的關鍵詞,因為會自動編碼,所以我看看到的是%E5…類似資訊,其他內容則是無關資訊,這裡我們不討論,下面點選下一頁,
http://weixin.sogou.com/weixin?oq=&query=%E5%B0%8F%E7%B1%B3&_sug_type_=1&sut=0&lkt=0%2C0%2C0&s_from=input
&ri=0&_sug_=n&type=2&sst0=1520605349656&page=2&ie=utf8&p=40040108&dp=1
&w=01015002&dr=1

觀察到連結中多出了一個page欄位,對應於數字2,猜測這就是頁碼,嘗試之後,果不其然,我們的猜測是正確的,下面我們來爬取文章的具體內容

2.提取文章地址

點選F12,通過控制檯,我們發現文章網址對應在,txt-box元素下,如圖

由此,我們很容易的寫出我們的匹配模式程式碼

'<div class="txt-box">.*?(http://.*?)"'

對應完整程式碼如下:

def getlisturl(key, pagestart, pageend):
try:
    page = pagestart
    keycode = urllib.request.quote(key)
    pagecode = urllib.request.quote("&page")
    for page in range(pagestart, pageend+1):
        url = "http://weixin.sogou.com/weixin?type=2&query=" + keycode + pagecode + str(page)
        data1 = downloader(url)
        listurlpat = '<div class="txt-box">.*?(http://.*?)"'
        data2 = re.compile(listurlpat, re.S)
        result = data2.findall(data1)
        listurl.append(result)
        #print(listurl)
    print("共獲取到" + str(len(listurl)) + "頁")
    return listurl
except urllib.error.URLError as e:
    if hasattr(e, "code"):
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)
    time.sleep(10)
except Exception as e:
    print("exception:" + str(e))
    time.sleep(1)

其中,key對應於關鍵詞,pagestart對應於起始頁碼,預設為第一頁,pageend對應於終止頁碼。

下載模組

我們需要從對應連結中下載內容,才能做到後面內容的提取,這裡,下載模組我提供了兩種方案,一種使用了代理伺服器,一種沒有使用,下面是沒有使用代理伺服器的程式碼:

def downloader(url):
headers = ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
                         "(KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 "
                         "Core/1.63.4793.400 QQBrowser/10.0.702.400")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
try:
    data = urllib.request.urlopen(url).read()
    data = data.decode('utf-8')
    return data
except urllib.error.URLError as e:
    if hasattr(e, "code"):
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)
    time.sleep(10)
except Exception as e:
    print("exception:" + str(e))
    time.sleep(1)

這裡使用了一些簡單的偽裝,以防止微信對爬蟲的阻止,但這樣並不安全,所以在完整程式碼中還提供了代理伺服器模組,但是因為免費的代理伺服器幾乎都不能用,所以在此不多講解。

完整程式碼如下

import re
import urllib.request
import time
import urllib.error

'''
def use_proxy(proxy_addr, url):
    headers = ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
                             "(KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 "
                             "Core/1.63.4793.400 QQBrowser/10.0.702.400")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    urllib.request.install_opener(opener)
    try:
        proxy = urllib.request.ProxyHandler({'http':proxy_addr})
        opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
        urllib.request.install_opener(opener)
        data = urllib.request.urlopen(url).read().decode('utf-8')
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        time.sleep(10)
    except Exception as e:
        print("exception:" + str(e))
        time.sleep(1)
'''


def downloader(url):
    headers = ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
                             "(KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 "
                             "Core/1.63.4793.400 QQBrowser/10.0.702.400")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    urllib.request.install_opener(opener)
    try:
        data = urllib.request.urlopen(url).read()
        data = data.decode('utf-8')
        return data
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        time.sleep(10)
    except Exception as e:
        print("exception:" + str(e))
        time.sleep(1)


def getlisturl(key, pagestart, pageend):
    try:
        page = pagestart
        keycode = urllib.request.quote(key)
        pagecode = urllib.request.quote("&page")
        for page in range(pagestart, pageend+1):
            url = "http://weixin.sogou.com/weixin?type=2&query=" + keycode + pagecode + str(page)
            data1 = downloader(url)
            listurlpat = '<div class="txt-box">.*?(http://.*?)"'
            data2 = re.compile(listurlpat, re.S)
            result = data2.findall(data1)
            listurl.append(result)
            #print(listurl)
        print("共獲取到" + str(len(listurl)) + "頁")
        return listurl
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
        time.sleep(10)
    except Exception as e:
        print("exception:" + str(e))
        time.sleep(1)


def getcontent(listurl):
    i = 0
    html1 = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>微信文章頁面</title>
    </head>
    <body>'''
    fh = open("C://Users/59574/Desktop/python/test/WeChatSpider/1.html", "wb")
    fh.write(html1.encode("utf-8"))
    fh.close()
    fh = open("C://Users/59574/Desktop/python/test/WeChatSpider/1.html", "ab")
    for i in range(0, len(listurl)):
        for j in range(0, len(listurl[i])):
            try:
                url = listurl[i][j]
                url = url.replace("amp;", "")
                data = downloader(url)
                titlepat = "<title>(.*?)</title>"
                contentpat = 'id="js_content">(.*?)id="js_sg_bar"'
                title = re.compile(titlepat).findall(data)
                content = re.compile(contentpat, re.S).findall(data)
                if (title != []):
                    thistitle = title[0]
                if (content != []):
                    thiscontent = content[0]
                dataall = "<p>標題為:" + thistitle + "</p><p>內容為:" + thiscontent + "</p><br>"
                fh.write(dataall.encode("utf-8"))
                print("第"+str(i+1) + "個網頁第" + str(j+1) + "條內容儲存")
            except urllib.error.URLError as e:
                if hasattr(e, "code"):
                    print(e.code)
                if hasattr(e, "reason"):
                    print(e.reason)
                time.sleep(10)
            except Exception as e:
                print("exception:" + str(e))
                time.sleep(1)
    fh.close()
    html2 = '''</body>
    </html>
    '''
    fh = open("C://Users/59574/Desktop/python/test/WeChatSpider/1.html", "ab")
    fh.write(html2.encode("utf-8"))
    fh.close()


if __name__ == '__main__':
    listurl = list()
    key = str(input('請輸入要查詢的關鍵詞:'))
    #proxy = "115.223.209.169:9000"
    pagestart = 1
    pageend = int(input('請輸入結束頁碼(每頁儲存10條內容)'))
    listurl = getlisturl(key, pagestart, pageend)
    getcontent(listurl)

Github版本

執行結果

小結

開學了,事情有些多,個人的一些事情也讓人難過,不過我會盡力更新,下一次我會為大家帶來該程式碼多執行緒執行版本。
最後,希望大家都能尊重我的付出,轉載我的文章請註明出處;
也希望大家關注我的個人部落格 HD Blog

謝謝~