編寫爬蟲爬取百度貼吧帖子的學習筆記
阿新 • • 發佈:2019-02-02
再接再厲,再次使用python3學習編寫了一個爬取百度貼吧帖子的程式,不多說,直接上關鍵程式碼
完整程式碼也上傳到了github: https://github.com/callMeBin2217/python3_Spider 有興趣的朋友可以下載來看看,或者和我交流交流。小小小小小小白求輕噴#抓取貼吧一個帖子上的內容(一頁內容) import urllib import urllib.request import re page = 1 baseUrl = r'https://tieba.baidu.com/p/2687476192' seeLZ = 0 try: url = baseUrl+'?see_lz='+str(seeLZ)+'&pn='+str(page) request = urllib.request.Request(url) response = urllib.request.urlopen(request) content = response.read().decode('utf-8') #獲取帖子標題 patternTitle = re.compile(r'<h\d class="core_title_txt.*?>(.*?)</h\d>',re.S) resultTitle = re.search(patternTitle,content) print(resultTitle.group(1).strip()) #獲取帖子回覆數和總頁數 patternNum = re.compile(r'<li class="l_reply_num".*?><span.*?>(.*?)</span.*?<span.*?>(.*?)</span>',re.S) resultNum =re.search(patternNum,content) print(resultNum.group(1).strip(),resultNum.group(2).strip()) #獲取帖子每層樓內容 patternContent = re.compile(r'<div id="post_content_.*?">(.*?)</div>',re.S) items = re.findall(patternContent,content) tool = Tool() for item in items: print('\n',tool.replace(item),'\n') except urllib.request.URLError as e: if hasattr(e,'reason'): print(e.reason)