1. 程式人生 > >一個簡書的爬蟲,可以設定頁碼,抓取文章標題、簡介以及連結

一個簡書的爬蟲,可以設定頁碼,抓取文章標題、簡介以及連結

 1 #coding=utf-8
 2 import requests
 3 from bs4 import BeautifulSoup
 4 
 5 m=input("請輸入想要抓取的頁碼數量:")
 6 for i in range(1,int(m)):
 7     url="https://www.jianshu.com/?page="+str(i)
 8     headers={
 9         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
10         '
Accept': 'text/html, */*; q=0.01', 11 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 12 'Accept-Encoding': 'gzip, deflate', 13 'Referer': 'https://www.jianshu.com/', 14 'X-INFINITESCROLL': 'true', 15 'X-Requested-With': 'XMLHttpRequest',
16 'Connection': 'close', 17 } 18 html=requests.get(url=url,headers=headers) 19 soup = BeautifulSoup(html.text.encode(html.encoding).decode('utf-8'), 'html.parser') 20 # 以格式化的形式列印html 21 #print(soup.prettify()) 22 titles = soup.find_all('a', 'title') 23 titlesp = soup.find_all('
p', 'abstract') 24 with open(r"./文章簡介.txt","a",encoding='utf-8') as file: 25 for (title,titlep) in zip(titles,titlesp): 26 file.write(title.string+'\n') 27 file.write(titlep.string+'\n') 28 file.write("https://www.jianshu.com" + title.get('href')+'\n\n')</code> 29 30 print("執行完畢,儲存在目錄:./文章簡介.txt")

環境:python3

模組:requests、bs4