Python爬蟲例項：從百度貼吧下載多頁話題內容

阿新 • • 發佈：2018-12-15

上週網路爬蟲課程中，留了一個實踐：從百度貼吧下載多頁話題內容。我完成的是從貼吧中一個帖子中爬取多頁內容，與老師題目要求的從貼吧中爬取多頁話題還是有一定區別的，況且，在老師講評之後，我瞬間就發現了自己跟老師程式碼之間的差距了，我在程式碼書寫上還是存在很多不規範不嚴謹的地方，而且也沒有體現出面向物件的思想，所以，重新將這個題目做一遍，學習一下大佬是怎麼寫的。

例項：從百度貼吧下載多頁話題內容

loadPage(url) 用於獲取網頁
writePage(html,filename) 用於將已獲得的網頁儲存為本地檔案
tiebaCrawler(url,beginpPage,endPage,keyword)用於排程，提供需要抓取的頁面URLs

main：程式主控模組，完成基本命令列互動介面

import urllib.request
import urllib.parse

def loadPage(url):
    '''
        Function: Fetching url and accessing the webpage content
        url: the wanted webpage url
    '''
    headers = {
            'Accept': 'text/html',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    print('To send HTTP request to %s ' % url)
    request = urllib.request.Request(url, headers=headers)
    
    return urllib.request.urlopen(request).read().decode('utf-8')

def writePage(html, filename):
    '''
        Function : To write the content of html into a local file
        html : The response content
        filename : the local filename to be used stored the response
    '''
    print('To write html into a local file %s ...' % filename)
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(str(html))
    print('Work done!')
    print('---'*10)

def tiebaCrawler(url, beginPage, endPage, keyword):
    '''
        Function: The scheduler of crawler, is used to access every wanted url in turns
        url: the url of tieba's webpage
        beginPage: initial page
        endPage: end page
        keyword: the wanted keyword
    '''
    for page in range(beginPage,endPage+1):
        pn = (page - 1) * 50
        queryurl = url + '&pn=' + str(pn)
        filename = keyword + str(page) + '_tieba.html'
        writePage(loadPage(queryurl), filename)
        

if __name__ == '__main__':
    kw = input('Please input the wanted tieba\'s name:')
    beginPage = int(input('The beginning page number:'))
    endPage = int(input('The ending page number:'))
    
    # 百度貼吧查詢 url 例子：http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&red_tag=m2239217474
    url = 'http://tieba.baidu.com/f?ie=utf-8&'
    key = urllib.parse.urlencode({'kw':kw})
    queryurl = url + key
    
    tiebaCrawler(queryurl, beginPage, endPage, kw)

執行結果:

Please input the wanted tieba's name:趙麗穎
The beginning page number:1
The ending page number:3
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=0 
To write html into a local file 趙麗穎1_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=50 
To write html into a local file 趙麗穎2_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=100 
To write html into a local file 趙麗穎3_tieba.html ...
Work done!
------------------------------

心得收穫

將功能儘量拆分開來，每個函式只做一件事兒，控制好函式的輸入和輸出，體現面向物件的思想。
每個函式都要寫備註（格式如上文程式碼中），講清楚兩件事，a.函式是做什麼用的？b.函式的引數各表示什麼？交代清楚了這些，不僅可以大大增加程式碼的可讀性，還可以督促自己規範程式碼。

Python爬蟲例項：從百度貼吧下載多頁話題內容

例項：從百度貼吧下載多頁話題內容

心得收穫

Python爬蟲例項：從百度貼吧下載多頁話題內容

Python爬蟲例項--爬取百度貼吧小說

Python爬蟲--- 1.5 爬蟲實踐：獲取百度貼吧內容

python爬蟲(13)爬取百度貼吧帖子

完整的爬蟲程序爬取百度貼吧的圖片

簡單爬蟲，爬去百度貼吧圖片

scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

Python爬蟲教程：爬取百度貼吧

Python簡易爬蟲爬取百度貼吧圖片

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

python 爬蟲百度貼吧簽到小工具

Python實現簡單爬蟲功能--批量下載百度貼吧裡的圖片

Python爬蟲系列之百度貼吧爬取

Python爬蟲-爬取百度貼吧

Python爬蟲 -下載百度貼吧圖片

python百度貼吧圖片下載指令碼例項

實戰python 爬蟲爬取百度貼吧圖片

python網路爬蟲學習(二)一個爬取百度貼吧的爬蟲程式

python爬蟲爬取百度貼吧（入門練習）

Python爬蟲【實戰篇】百度貼吧爬取頁面存到本地

Python爬蟲例項：從百度貼吧下載多頁話題內容

例項：從百度貼吧下載多頁話題內容

心得收穫

相關推薦