1. 程式人生 > >【Python爬蟲第二彈】基於爬蟲爬取豆瓣書籍的書籍資訊查詢

【Python爬蟲第二彈】基於爬蟲爬取豆瓣書籍的書籍資訊查詢

爬蟲學了有半個月的時間了,其實這半個月真正學到的東西也不過就是requsets和beautifulsoup的用法,慚愧,收穫不太大,還沒有接觸scrapy框架,但是光這個beautifulsoup可以完成的事情已經很多了,然後簡單的使用了pandas可以將爬取到的資料整理一下,還沒到可以分析的地步
由於先前無知,沒想到爬取速度過快會導致被封ip,所以在某一天爬豆瓣的時候什麼資訊都爬不出來了,然後就百度了一下得新增請求頭,然後加上等待時間(講道理,這個等待時間時及其不願意搞得,太慢的,也不會分散式,搞得效率特別低,但是為了防止被封ip,還是加上了),然後就是新增請求頭,不知道我的請求頭新增的有用沒,看網上還有個cookies,到目前為止還沒搞清楚cookies怎麼玩,學長說的加ip池也不知道怎麼加,所以,至今仍是小白一個,然後簡單的寫了一個基於爬蟲查詢書籍資訊,還不夠完美,先扔出來吧

#根據使用者輸入書名查詢該書在豆瓣的評分及相關書籍資訊
import pandas      
import requests
import time
import pandas
from bs4 import BeautifulSoup
url = 'https://read.douban.com/search?q={}&start={}'
url2 = 'https://read.douban.com'
#設定請求頭
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' headers['Host'] = 'read.douban.com' headers['Connection'] = 'keep-alive' headers['Accept-Encoding'] = 'gzip, deflate, br' headers['Accept-Language'] = 'zh-CN,zh;q=0.8' def get_bookcon(url): res = requests.get(url) soup = BeautifulSoup(res.text,'html.parser'
) #獲取該書導言 print('本書導言') for soup3 in soup.find_all('div',class_='article-profile-intro article-abstract collapse-context'): for soup4 in soup3.find_all('p'): print(soup4.text) print('\n\n') #獲取該書熱門劃線 print('本書熱門劃線') for soup3 in soup.find_all('div',class_='popular-annotations'): for soup4 in soup3.find_all('li'): print('”',soup4.text,'“','\n') print('\n') def get_message(soup2,bookname): result = {} #將含有書籍評分的書籍資訊打印出來 if len(soup2.find_all('div',class_='title')) > 0 and len(soup2.find_all('a',class_='author-item')) > 0: bookname = soup2.find_all('div',class_='title')[0].text booklink = url2 + soup2.find_all('div',class_='title')[0].find_all('a')[0]['href']#獲取該書詳情連結 bookauthor = soup2.find_all('a',class_='author-item')[0].text if len(soup2.find_all('span',class_='rating-average')) > 0: bookscore = float(soup2.find_all('span',class_='rating-average')[0].text) else: print('該書籍暫無評分!系統預設評分為0') bookscore = 0 #result['書名'] = bookname #result['作者'] = bookauthor #result['評分'] = bookscore print('書名','《',bookname,'》','\t作者:',bookauthor,'\t評分:',bookscore,'\n\n') print(booklink) get_bookcon(booklink) if soup2.find_all('div',class_='title')[0].text == bookname:#將查詢到的與使用者輸入的書籍名稱一樣的書籍資訊存入到字典中 result['書名'] = bookname result['作者'] = bookauthor result['評分'] = bookscore return result bookname = input('請輸入所要查詢的書籍名稱/作者:') booklist2 = [] def search_book(bookname): session = requests.Session() i = 1 for num in range(0,50,10):#預設搜查到書籍前五頁的書籍資訊評分 print('第',i,'頁的書籍資訊') newurl = url.format(str(bookname),num) bookhtml = session.get(newurl,headers = headers) soup = BeautifulSoup(bookhtml.text,'lxml') for soup2 in soup.find_all('li',class_='item store-item'): #print(soup2) booklist2.append(get_message(soup2,bookname)) time.sleep(2) i = i+1 search_book(bookname) df2 = pandas.DataFrame(booklist2,columns = ['書名','作者','評分']) df2.dropna(how='any')#去掉含有缺失值的行

查詢結果
查詢到的書籍資訊以表格形式輸出
歡迎大佬測試修改