【Python爬蟲第二彈】基於爬蟲爬取豆瓣書籍的書籍資訊查詢
阿新 • • 發佈:2019-01-23
爬蟲學了有半個月的時間了,其實這半個月真正學到的東西也不過就是requsets和beautifulsoup的用法,慚愧,收穫不太大,還沒有接觸scrapy框架,但是光這個beautifulsoup可以完成的事情已經很多了,然後簡單的使用了pandas可以將爬取到的資料整理一下,還沒到可以分析的地步
由於先前無知,沒想到爬取速度過快會導致被封ip,所以在某一天爬豆瓣的時候什麼資訊都爬不出來了,然後就百度了一下得新增請求頭,然後加上等待時間(講道理,這個等待時間時及其不願意搞得,太慢的,也不會分散式,搞得效率特別低,但是為了防止被封ip,還是加上了),然後就是新增請求頭,不知道我的請求頭新增的有用沒,看網上還有個cookies,到目前為止還沒搞清楚cookies怎麼玩,學長說的加ip池也不知道怎麼加,所以,至今仍是小白一個,然後簡單的寫了一個基於爬蟲查詢書籍資訊,還不夠完美,先扔出來吧
#根據使用者輸入書名查詢該書在豆瓣的評分及相關書籍資訊
import pandas
import requests
import time
import pandas
from bs4 import BeautifulSoup
url = 'https://read.douban.com/search?q={}&start={}'
url2 = 'https://read.douban.com'
#設定請求頭
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
headers['Host'] = 'read.douban.com'
headers['Connection'] = 'keep-alive'
headers['Accept-Encoding'] = 'gzip, deflate, br'
headers['Accept-Language'] = 'zh-CN,zh;q=0.8'
def get_bookcon(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser' )
#獲取該書導言
print('本書導言')
for soup3 in soup.find_all('div',class_='article-profile-intro article-abstract collapse-context'):
for soup4 in soup3.find_all('p'):
print(soup4.text)
print('\n\n')
#獲取該書熱門劃線
print('本書熱門劃線')
for soup3 in soup.find_all('div',class_='popular-annotations'):
for soup4 in soup3.find_all('li'):
print('”',soup4.text,'“','\n')
print('\n')
def get_message(soup2,bookname):
result = {}
#將含有書籍評分的書籍資訊打印出來
if len(soup2.find_all('div',class_='title')) > 0 and len(soup2.find_all('a',class_='author-item')) > 0:
bookname = soup2.find_all('div',class_='title')[0].text
booklink = url2 + soup2.find_all('div',class_='title')[0].find_all('a')[0]['href']#獲取該書詳情連結
bookauthor = soup2.find_all('a',class_='author-item')[0].text
if len(soup2.find_all('span',class_='rating-average')) > 0:
bookscore = float(soup2.find_all('span',class_='rating-average')[0].text)
else:
print('該書籍暫無評分!系統預設評分為0')
bookscore = 0
#result['書名'] = bookname
#result['作者'] = bookauthor
#result['評分'] = bookscore
print('書名','《',bookname,'》','\t作者:',bookauthor,'\t評分:',bookscore,'\n\n')
print(booklink)
get_bookcon(booklink)
if soup2.find_all('div',class_='title')[0].text == bookname:#將查詢到的與使用者輸入的書籍名稱一樣的書籍資訊存入到字典中
result['書名'] = bookname
result['作者'] = bookauthor
result['評分'] = bookscore
return result
bookname = input('請輸入所要查詢的書籍名稱/作者:')
booklist2 = []
def search_book(bookname):
session = requests.Session()
i = 1
for num in range(0,50,10):#預設搜查到書籍前五頁的書籍資訊評分
print('第',i,'頁的書籍資訊')
newurl = url.format(str(bookname),num)
bookhtml = session.get(newurl,headers = headers)
soup = BeautifulSoup(bookhtml.text,'lxml')
for soup2 in soup.find_all('li',class_='item store-item'):
#print(soup2)
booklist2.append(get_message(soup2,bookname))
time.sleep(2)
i = i+1
search_book(bookname)
df2 = pandas.DataFrame(booklist2,columns = ['書名','作者','評分'])
df2.dropna(how='any')#去掉含有缺失值的行
歡迎大佬測試修改