Python爬蟲例項:從百度貼吧下載多頁話題內容
阿新 • • 發佈:2018-12-15
上週網路爬蟲課程中,留了一個實踐:從百度貼吧下載多頁話題內容。我完成的是從貼吧中一個帖子中爬取多頁內容,與老師題目要求的從貼吧中爬取多頁話題還是有一定區別的,況且,在老師講評之後,我瞬間就發現了自己跟老師程式碼之間的差距了,我在程式碼書寫上還是存在很多不規範不嚴謹的地方,而且也沒有體現出面向物件的思想,所以,重新將這個題目做一遍,學習一下大佬是怎麼寫的。
例項:從百度貼吧下載多頁話題內容
- loadPage(url) 用於獲取網頁
- writePage(html,filename) 用於將已獲得的網頁儲存為本地檔案
- tiebaCrawler(url,beginpPage,endPage,keyword)用於排程,提供需要抓取的頁面URLs
- main:程式主控模組,完成基本命令列互動介面
import urllib.request import urllib.parse def loadPage(url): ''' Function: Fetching url and accessing the webpage content url: the wanted webpage url ''' headers = { 'Accept': 'text/html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36', } print('To send HTTP request to %s ' % url) request = urllib.request.Request(url, headers=headers) return urllib.request.urlopen(request).read().decode('utf-8') def writePage(html, filename): ''' Function : To write the content of html into a local file html : The response content filename : the local filename to be used stored the response ''' print('To write html into a local file %s ...' % filename) with open(filename, 'w', encoding='utf-8') as f: f.write(str(html)) print('Work done!') print('---'*10) def tiebaCrawler(url, beginPage, endPage, keyword): ''' Function: The scheduler of crawler, is used to access every wanted url in turns url: the url of tieba's webpage beginPage: initial page endPage: end page keyword: the wanted keyword ''' for page in range(beginPage,endPage+1): pn = (page - 1) * 50 queryurl = url + '&pn=' + str(pn) filename = keyword + str(page) + '_tieba.html' writePage(loadPage(queryurl), filename) if __name__ == '__main__': kw = input('Please input the wanted tieba\'s name:') beginPage = int(input('The beginning page number:')) endPage = int(input('The ending page number:')) # 百度貼吧查詢 url 例子:http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&red_tag=m2239217474 url = 'http://tieba.baidu.com/f?ie=utf-8&' key = urllib.parse.urlencode({'kw':kw}) queryurl = url + key tiebaCrawler(queryurl, beginPage, endPage, kw)
執行結果: Please input the wanted tieba's name:趙麗穎 The beginning page number:1 The ending page number:3 To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=0 To write html into a local file 趙麗穎1_tieba.html ... Work done! ------------------------------ To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=50 To write html into a local file 趙麗穎2_tieba.html ... Work done! ------------------------------ To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=100 To write html into a local file 趙麗穎3_tieba.html ... Work done! ------------------------------
心得收穫
- 將功能儘量拆分開來,每個函式只做一件事兒,控制好函式的輸入和輸出,體現面向物件的思想。
- 每個函式都要寫備註(格式如上文程式碼中),講清楚兩件事,a.函式是做什麼用的?b.函式的引數各表示什麼?交代清楚了這些,不僅可以大大增加程式碼的可讀性,還可以督促自己規範程式碼。