python 爬取豆瓣書籍資訊

阿新 • • 發佈：2019-11-13

繼爬取貓眼電影TOP100榜單之後，再來爬一下豆瓣的書籍資訊（主要是書的資訊，評分及佔比，評論並未爬取）。原創，轉載請聯絡我。

需求：爬取豆瓣某型別標籤下的所有書籍的詳細資訊及評分

語言：python

支援庫：

正則、解析和搜尋：re、requests、bs4、lxml （後三者需要安裝）
隨機數：time、random

步驟：三步走

訪問標籤頁面，獲取該標籤下的所有書籍的連結
逐一訪問書籍連結，爬取書籍資訊和評分
持久化儲存書籍資訊（這裡用了excel，可以使用資料庫）

一、訪問標籤頁面，獲取該標籤下的所有書籍的連結

照例，我們先看一下豆瓣的Robots.txt ，不能爬取禁止的內容。

我們這一步要爬取的標籤頁面，以小說為例 https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4

先去看看它的HTML結構

發現，每一本書，都在一個<li>標籤當中，而我們需要的只是那張圖片的連結（就是書籍頁面的連結）

這樣，就可以寫正則或者是利用bs4(BeatuifulSoup)來獲取書籍的連結。

可以看到，每一頁只顯示了20本書，所以需要遍歷訪問所有的頁面，它的頁面連結也是有規律的。

第二頁：https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T

第三頁：https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=40&type=T

即：start每次遞增20就好了。

下面來看程式碼：

 1 # -*- coding: utf-8 -*-
 2 # @Author  : yocichen
 3 # @Email   : [email protected]
 4 # @File    : labelListBooks.py
 5 # @Software: PyCharm
 6 # @Time    : 2019/11/11 20:10
 7 
 8 import re
 9 import openpyxl
10 import requests
11 from requests import RequestException
12 from bs4 import BeautifulSoup
13 import lxml
14 import time
15 import random
16 
17 src_list = []
18 
19 def get_one_page(url):
20     '''
21     Get the html of a page by requests module
22     :param url: page url
23     :return: html / None
24     '''
25     try:
26         head = ['Mozilla/5.0', 'Chrome/78.0.3904.97', 'Safari/537.36']
27         headers = {
28             'user-agent':head[random.randint(0, 2)]
29         }
30         response = requests.get(url, headers=headers, proxies={'http':'171.15.65.195:9999'}) # 這裡的代理，可以設定也可以不加，如果失效，不加或者替換其他的即可
31         if response.status_code == 200:
32             return response.text
33         return None
34     except RequestException:
35         return None
36 
37 def get_page_src(html, selector):
38     '''
39     Get book's src from label page
40     :param html: book
41     :param selector: src selector
42     :return: src(list)
43     '''
44     # html = get_one_page(url)
45     if html is not None:
46         soup = BeautifulSoup(html, 'lxml')
47         res = soup.select(selector)
48         pattern = re.compile('href="(.*?)"', re.S)
49         src = re.findall(pattern, str(res))
50         return src
51     else:
52         return []
53 
54 def write_excel_xlsx(items, file):
55     '''
56     Write the useful info into excel(*.xlsx file)
57     :param items: book's info
58     :param file: memory excel file
59     :return: the num of successful item
60     '''
61     wb = openpyxl.load_workbook(file)
62     ws = wb.worksheets[0]
63     sheet_row = ws.max_row
64     item_num = len(items)
65     # Write film's info
66     for i in range(0, item_num):
67         ws.cell(sheet_row+i+1, 1).value = items[i]
68     # Save the work book as *.xlsx
69     wb.save(file)
70     return item_num
71 
72 if __name__ == '__main__':
73     total = 0
74     for page_index in range(0, 50): # 這裡為什麼是50頁？豆瓣看起來有很多頁，其實訪問到後面就沒有資料了，目前是隻有50頁可訪問。
75         # novel label src : https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=
76         # program label src : https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=
77         # computer label src : https://book.douban.com/tag/%E8%AE%A1%E7%AE%97%E6%9C%BA?start=
78         # masterpiece label src : https://book.douban.com/tag/%E5%90%8D%E8%91%97?start=
79         url = 'https://book.douban.com/tag/%E5%90%8D%E8%91%97?start='+str(page_index*20)+'&type=T' # 你要做的就是把URL前面的部分替換成你所有爬的那個標籤的對應部分,確切的來說是紅色加粗的文字部分。
80         one_loop_done = 0
81         # only get html page once
82         html = get_one_page(url)
83         for book_index in range(1, 21):
84             selector = '#subject_list > ul > li:nth-child('+str(book_index)+') > div.info > h2'
85             src = get_page_src(html, selector)
86             row = write_excel_xlsx(src, 'masterpiece_books_src.xlsx') # 要儲存的檔案，需要先建立好
87             one_loop_done += row
88         total += one_loop_done
89         print(one_loop_done, 'done')
90     print('Total', total, 'done')

註釋比較清楚了，先獲取頁面HTML，正則或者bs4遍歷獲取每一頁當中的書籍連結，存到excel檔案中。

注意：如果需要直接使用我的程式碼，你只需要去看一下那個標籤頁面的連結，而後把紅色加粗部分（中文標籤編碼）替換即可，以及先建立一個excel檔案，用以儲存爬到的書籍連結。

二、逐一訪問書籍連結，爬取書籍資訊和評分

上一步我們已經爬到了，小說標籤下的所有書籍的src，這一步，就是要逐一去訪問書籍的src，然後爬取書籍的具體資訊。

先看看要爬的資訊的HTML結構

下面是書籍資訊頁面結構

再是評分頁面結構

這樣就可以利用正則表示式和bs4庫來匹配到我們所需要的資料了。（試了試純正則，比較難寫，行不通）

下面看程式碼

  1 # -*- coding: utf-8 -*-
  2 # @Author  : yocichen
  3 # @Email   : [email protected]
  4 # @File    : doubanBooks.py
  5 # @Software: PyCharm
  6 # @Time    : 2019/11/9 11:38
  7 
  8 import re
  9 import openpyxl
 10 import requests
 11 from requests import RequestException
 12 from bs4 import BeautifulSoup
 13 import lxml
 14 import time
 15 import random
 16 
 17 def get_one_page(url):
 18     '''
 19     Get the html of a page by requests module
 20     :param url: page url
 21     :return: html / None
 22     '''
 23     try:
 24         head = ['Mozilla/5.0', 'Chrome/78.0.3904.97', 'Safari/537.36']
 25         headers = {
 26             'user-agent':head[random.randint(0, 2)]
 27         }
 28         response = requests.get(url, headers=headers) #, proxies={'http':'171.15.65.195:9999'}
 29         if response.status_code == 200:
 30             return response.text
 31         return None
 32     except RequestException:
 33         return None
 34 
 35 def get_request_res(pattern_text, html):
 36     '''
 37     Get the book info by re module
 38     :param pattern_text: re pattern
 39     :param html: page's html text
 40     :return: book's info
 41     '''
 42     pattern = re.compile(pattern_text, re.S)
 43     res = re.findall(pattern, html)
 44     if len(res) > 0:
 45         return res[0].split('<', 1)[0][1:]
 46     else:
 47         return 'NULL'
 48 
 49 def get_bs_res(selector, html):
 50     '''
 51     Get the book info by bs4 module
 52     :param selector: info selector
 53     :param html: page's html text
 54     :return: book's info
 55     '''
 56     soup = BeautifulSoup(html, 'lxml')
 57     res = soup.select(selector)
 58     # if res is not None or len(res) is not 0:
 59     #     return res[0].string
 60     # else:
 61     #     return 'NULL'
 62     if res is None:
 63         return 'NULL'
 64     elif len(res) == 0:
 65         return 'NULL'
 66     else:
 67         return res[0].string
 68 
 69 # Get other info by bs module
 70 def get_bs_img_res(selector, html):
 71     soup = BeautifulSoup(html, 'lxml')
 72     res = soup.select(selector)
 73     if len(res) is not 0:
 74         return str(res[0])
 75     else:
 76         return 'NULL'
 77 
 78 def parse_one_page(html):
 79     '''
 80     Parse the useful info of html by re module
 81     :param html: page's html text
 82     :return: all of book info(dict)
 83     '''
 84     book_info = {}
 85     book_name = get_bs_res('div > h1 > span', html)
 86     # print('Book-name', book_name)
 87     book_info['Book_name'] = book_name
 88     # info > a:nth-child(2)
 89     author = get_bs_res('div > span:nth-child(1) > a', html)
 90     if author is None:
 91         author = get_bs_res('#info > a:nth-child(2)', html)
 92     # print('Author', author)
 93     author = author.replace(" ", "")
 94     author = author.replace("\n", "")
 95     book_info['Author'] = author
 96 
 97     publisher = get_request_res(u'出版社:</span>(.*?)<br/>', html)
 98     # print('Publisher', publisher)
 99     book_info['publisher'] = publisher
100 
101     publish_time = get_request_res(u'出版年:</span>(.*?)<br/>', html)
102     # print('Publish-time', publish_time)
103     book_info['publish_time'] = publish_time
104 
105     ISBN = get_request_res(u'ISBN:</span>(.*?)<br/>', html)
106     # print('ISBN', ISBN)
107     book_info['ISBN'] = ISBN
108 
109     img_label = get_bs_img_res('#mainpic > a > img', html)
110     pattern = re.compile('src="(.*?)"', re.S)
111     img = re.findall(pattern, img_label)
112     if len(img) is not 0:
113         # print('img-src', img[0])
114         book_info['img_src'] = img[0]
115     else:
116         # print('src not found')
117         book_info['img_src'] = 'NULL'
118 
119     book_intro = get_bs_res('#link-report > div:nth-child(1) > div > p', html)
120     # print('book introduction', book_intro)
121     book_info['book_intro'] = book_intro
122 
123     author_intro = get_bs_res('#content > div > div.article > div.related_info > div:nth-child(4) > div > div > p', html)
124     # print('author introduction', author_intro)
125     book_info['author_intro'] = author_intro
126 
127     grade = get_bs_res('div > div.rating_self.clearfix > strong', html)
128     if len(grade) == 1:
129         # print('Score no mark')
130         book_info['Score'] = 'NULL'
131     else:
132         # print('Score', grade[1:])
133         book_info['Score'] = grade[1:]
134 
135     comment_num = get_bs_res('#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span', html)
136     # print('commments', comment_num)
137     book_info['commments'] = comment_num
138 
139     five_stars = get_bs_res('#interest_sectl > div > span:nth-child(5)', html)
140     # print('5-stars', five_stars)
141     book_info['5_stars'] = five_stars
142 
143     four_stars = get_bs_res('#interest_sectl > div > span:nth-child(9)', html)
144     # print('4-stars', four_stars)
145     book_info['4_stars'] = four_stars
146 
147     three_stars = get_bs_res('#interest_sectl > div > span:nth-child(13)', html)
148     # print('3-stars', three_stars)
149     book_info['3_stars'] = three_stars
150 
151     two_stars = get_bs_res('#interest_sectl > div > span:nth-child(17)', html)
152     # print('2-stars', two_stars)
153     book_info['2_stars'] = two_stars
154 
155     one_stars = get_bs_res('#interest_sectl > div > span:nth-child(21)', html)
156     # print('1-stars', one_stars)
157     book_info['1_stars'] = one_stars
158 
159     return book_info
160 
161 def write_bookinfo_excel(book_info, file):
162     '''
163     Write book info into excel file
164     :param book_info: a dict
165     :param file: memory excel file
166     :return: the num of successful item
167     '''
168     wb = openpyxl.load_workbook(file)
169     ws = wb.worksheets[0]
170     sheet_row = ws.max_row
171     sheet_col = ws.max_column
172     i = sheet_row
173     j = 1
174     for key in book_info:
175         ws.cell(i+1, j).value = book_info[key]
176         j += 1
177     done = ws.max_row - sheet_row
178     wb.save(file)
179     return done
180 
181 def read_booksrc_get_info(src_file, info_file):
182     '''
183     Read the src file and access each src, parse html and write info into file
184     :param src_file: src file
185     :param info_file: memory file
186     :return: the num of successful item
187     '''
188     wb = openpyxl.load_workbook(src_file)
189     ws = wb.worksheets[0]
190     row = ws.max_row
191     done = 0
192     for i in range(868, row+1):
193         src = ws.cell(i, 1).value
194         if src is None:
195             continue
196         html = get_one_page(str(src))
197         book_info = parse_one_page(html)
198         done += write_bookinfo_excel(book_info, info_file)
199         if done % 10 == 0:
200             print(done, 'done')
201     return done
202 
203 if __name__ == '__main__':
204     # url = 'https://book.douban.com/subject/1770782/'
205     # html = get_one_page(url)
206     # # print(html)
207     # book_info = parse_one_page(html)
208     # print(book_info)
209     # res = write_bookinfo_excel(book_info, 'novel_books_info.xlsx')
210     # print(res, 'done')
211     res = read_booksrc_get_info('masterpiece_books_src.xlsx', 'masterpiece_books_info.xlsx') # 讀取的src檔案，要寫入書籍資訊的儲存檔案
212     print(res, 'done')

注意：如果要直接使用的話，需要做的只是給引數而已，第一個是上一步獲取的src檔案，第二個是需要儲存書籍資訊的檔案（需要事先建立一下）

三、持久化儲存書籍資訊（Excel）

使用excel儲存書籍的src列表和書籍的具體資訊，需要使用openpyxl庫進行讀寫excel。程式碼在上面write_*/read_*函式中。

效果

爬到的小說類書籍的src

爬到的書籍詳細資訊

後記

寫這個前後大概花了有兩整天吧，爬蟲要做的工作還是比較細緻的，需要分析HTML頁面還要寫正則表示式。話說，使用bs4真的是很簡單，只需要copy一下selector就ok了，較正則可以大大提高效率。另外，單執行緒爬蟲是比較蠢的。還有很多不足（諸如程式碼不規整，不夠健壯），歡迎指正。

參考資料

【1】豆瓣robots.txt https://www.douban.com/robots.txt

【2】https://blog.csdn.net/jerrygaoling/article/details/81051447

【3】https://blog.csdn.net/zhangfn2011/article/details/7821642

【4】https://www.kuaidaili.com/free

python 爬取豆瓣書籍資訊

繼爬取貓眼電影TOP100榜單之後，再來爬一下豆瓣的書籍資訊（主要是書的資訊，評分及佔比，評論並未爬取）。原創，轉載請聯絡我。需求：爬取豆瓣某型別標籤下的所有書籍的詳細資訊及評分語言：python 支援庫：正則、解析和搜尋：re、requests、bs4、lxml

python爬取豆瓣電影資訊

''' 用到的主要知識：(詳情見官方文件） 1. requests 2. BeautifulSoup 3. codecs 4. os ''' #-*-coding:utf-8 import requests from bs4 import Beautif

【Python爬蟲第二彈】基於爬蟲爬取豆瓣書籍的書籍資訊查詢

爬蟲學了有半個月的時間了，其實這半個月真正學到的東西也不過就是requsets和beautifulsoup的用法，慚愧，收穫不太大，還沒有接觸scrapy框架，但是光這個beautifulsoup可以完成的事情已經很多了，然後簡單的使用了pandas可以將爬取到

python爬取豆瓣電影Top250的資訊

python爬取豆瓣電影Top250的資訊 2018年07月25日 20:03:14 呢喃無音閱讀數：50 python爬取豆瓣電影Top250的資訊。初學，所以程式碼的不夠美觀和精煉。如果程式碼有錯，請各位讀者在評論區評論，以免誤導其他同學。（

python正則表示式爬取豆瓣圖書資訊

import requests import re content = requests.get('https://book.douban.com/').text pattern = re.compile('<li.*?cover.*?href="(.*?)".*?ti

Python爬蟲入門 | 2 爬取豆瓣電影資訊

這是一個適用於小白的Python爬蟲免費教學課程，只有7節，讓零基礎的你初步瞭解爬蟲，跟著課程內容能自己爬取資源。看著文章，開啟電腦動手實踐，平均45分鐘就能學完一節，如果你願意，今天內你就可以邁入爬蟲的大門啦~ 好啦，正式開始我們的第二節課《爬取豆瓣電影資訊

python爬取豆瓣小組700+話題加回復啦啦啦python open file with a variable name

技術分享 ash 寫入 blog ima ron tar 回復 -128 需求：爬取豆瓣小組所有話題（話題title，內容，作者，發布時間），及回復（最佳回復，普通回復，回復_回復，翻頁回復，0回復）解決：1. 先爬取小組下，所有的主題鏈接，通過定位nextp

python爬取豆瓣250存入mongodb全紀錄

xpath author cli content call function 取出 pycha 出版社用了一周的時間總算搞定了，跨過了各種坑，總算調試成功了，記錄如下： 1、首先在cmd中用命令行建立douban爬蟲項目 scrapy startproject douba

python 爬取豆瓣電影案例

數據 odin span content html temp com str self # conding=utf-8 from parse import parse_url import json class DoubanSpider: def __init

python爬取自如房間資訊(二)

主要是針對自如房價的爬取。以下程式碼對房價圖片進行處理，將裡面的數字提取出來，然後用knn最近鄰演算法去對圖片上的資料進行分類。 import sys import cv2 import numpy as np ####### training part ###########

python爬取自如房間資訊(一)

使用python和selenium+Chrome Headless爬取自如房間資訊，並將結果儲存在MongoDB中。其中最麻煩的應該是每間房的價格，因為自如是用一張圖片和offset來顯示價格，所以不能直接獲得。但我們可以通過將圖片轉為文字，再通過偏移量將數字組合為價格。在這裡我們使用的是Ch

詳解使用Python爬取豆瓣短評並繪製詞雲

使用Python爬取豆瓣短評並繪製詞雲成果如下(比較醜，湊合看) 1.分析網頁開啟想要爬取的電影，比如《找到你》，其短評如下: 檢視原始碼發現短評存放在<span>標籤裡並且class為short，所以通過爬取其裡邊的內容即可

python爬取杭州市幼兒園資訊

一、爬取前準備 1、IDE使用pycharm 2、安裝相關的庫，requests，re，xlsxwritter，beautifulsoup 如圖看到，網頁由頂部的區域，中間的學校列表和底部的分頁等幾個重要的部分組成。檢視網頁原始碼，可以看到上述的三個部分都

python 爬取豆瓣網搜尋結果同城活動資料

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料 bs4:網頁程式碼解析以下是原始碼： #!coding=utf-8 import requests

教你用Python爬取豆瓣圖書Top250

質量、速度、廉價，選擇其中兩個這篇文章將會用到上一篇文章所講的內容，如果沒有看過可以去看一下教你用Python寫excel 今天我們要做的就是用Python爬取豆瓣圖書Top250，先開啟網站看一下今天不談這豆瓣圖書top250垃圾不垃圾的問題，只看看怎麼用p

Python爬取豆瓣TOP250圖書排行榜

# -*- coding: utf-8 -*- import bs4 import requests def open_url(url): # url = 'https://movie.douban.com/top250' hd = {}

python爬取豆瓣影評

看的別人的程式碼爬取某部影片的影評沒有模擬登入只能爬6頁 # -*- encoding:utf-8 -*- import requests from bs4 import BeautifulSoup import re import random import io

一文搞懂如何用Python爬取上市公司資訊

1. 概念準備 Python基本概念 tb.to_csv(r'1.csv', mode='a', encoding='utf_8_sig', header=1, index=0) r意思是強制不轉義字串 TableTableTable型表格

Python 爬取豆瓣

... import urllib.request import time from bs4 import BeautifulSoup def url_open(url): response = urllib.request.urlopen(url) return response

[轉載]Python爬取豆瓣影評並生成詞雲圖程式碼

# -*- coding:utf-8 -*- ''' 抓取豆瓣電影某部電影的評論這裡以《我不是潘金蓮為例》網址連結:https://movie.douban.com/subject/26630781/comments 為了抓取全部評論需要先進行登入 '''

python 爬取豆瓣書籍資訊

一、訪問標籤頁面，獲取該標籤下的所有書籍的連結

二、逐一訪問書籍連結，爬取書籍資訊和評分

三、持久化儲存書籍資訊（Excel）

後記

參考資料

相關推薦