requests + re 爬去網站圖書資訊（Python）

阿新 • • 發佈：2019-01-18

# -*- coding: utf-8 -*-

import requests
import re, json

if __name__ == '__main__':
content = requests.get('https://book.douban.com/').text
reg_base = '<ul.*?list-col list-col5 list-express slide-item">(.*?)</ul>'
base_pattern = re.compile(reg_base, re.S)
base_html = re.findall(base_pattern, content)

href = '<li.*?cover.*?href="(.*?)".*?'
title = '<div.*?title.*?title.*?>(.*?)</a>.*?'
author = '<div.*?more-meta.*?author.*?>(.*?)</span>.*?</li>'
regex = href + title + author
pattern = re.compile(regex, re.S)
results = []
# 匹配子標籤中的圖書資訊
for html in base_html:
results += re.findall(pattern, html)

with open('touban.txt', 'w', encoding='utf-8') as f:
for result in results:
d = {
'href': result[0].strip(),
'title': result[1].strip(),
'author': result[2].replace(' ', '').strip()
}

f.write(json.dumps(d, ensure_ascii=False) + '\n')

{"author": "[法] 米歇爾·普西", "href": "https://book.douban.com/subject/30180673/?icn=index-editionrecommend", "title": "她不是我媽媽"}
{"author": "[意]馬西米利亞諾·威爾吉利奧", "href": "https://book.douban.com/subject/30180821/?icn=index-editionrecommend", "title": "那不勒斯的螢火"}
{"author": "於蕾，呂逸濤", "href": "https://book.douban.com/subject/30206904/?icn=index-editionrecommend", "title": "國家寶藏"}
{"author": "張立民", "href": "https://book.douban.com/subject/30235899/?icn=index-editionrecommend", "title": "最後一公里的哲學：電商物流全鏈條運營管理"}
{"author": "【英】詹姆斯•霍尼伯內（James Honeyborne）/【英】馬克•布朗羅（Mark Brownlow）", "href": "https://book.douban.com/subject/30183403/?icn=index-editionrecommend", "title": "BBC全新4K海洋百科：藍色星球II"}
{"author": "[葡] 若澤·薩拉馬戈", "href": "https://book.douban.com/subject/27598520/?icn=index-latestbook-subject", "title": "裡卡爾多·雷耶斯離世那年"}
{"author": "[美] 史蒂芬·平克", "href": "https://book.douban.com/subject/30186025/?icn=index-latestbook-subject", "title": "風格感覺"}
{"author": "趙壘", "href": "https://book.douban.com/subject/30204837/?icn=index-latestbook-subject", "title": "傀儡城之荊軻刺秦"}
{"author": "梅貽琦/黃延復/王小寧", "href": "https://book.douban.com/subject/30197575/?icn=index-latestbook-subject", "title": "梅貽琦西南聯大日記"}
{"author": "[日] 永井荷風", "href": "https://book.douban.com/subject/30171301/?icn=index-latestbook-subject", "title": "濹東綺譚"}
{"author": "[波蘭] 安傑伊·瓦伊達/Andrzej Wajda", "href": "https://book.douban.com/subject/30211002/?icn=index-latestbook-subject", "title": "我們一起拍片！"}
{"author": "[德] 弗蘭克·施茨廷", "href": "https://book.douban.com/subject/27604676/?icn=index-latestbook-subject", "title": "群"}
{"author": "[美] 克麗絲特爾·潘恩/Crystal Paine", "href": "https://book.douban.com/subject/30206819/?icn=index-latestbook-subject", "title": "會賺錢的媽媽"}
{"author": "[日] 石田衣良", "href": "https://book.douban.com/subject/27622428/?icn=index-latestbook-subject", "title": "美麗的孩子"}
{"author": "楊時暘", "href": "https://book.douban.com/subject/30218577/?icn=index-latestbook-subject", "title": "孤獨的影獵人"}
{"author": "[德]沃爾夫岡·赫倫多夫", "href": "https://book.douban.com/subject/27598521/?icn=index-latestbook-subject", "title": "小心，沙漠有人"}
{"author": "[英] 珍妮特·溫特森", "href": "https://book.douban.com/subject/27663541/?icn=index-latestbook-subject", "title": "我要快樂，不必正常"}
{"author": "朱一葉", "href": "https://book.douban.com/subject/30198364/?icn=index-latestbook-subject", "title": "死於象蹄"}
{"author": "[荷] 伊恩·布魯瑪", "href": "https://book.douban.com/subject/27662697/?icn=index-latestbook-subject", "title": "日本之鏡"}
{"author": "[美] 威廉·莫爾頓·馬斯頓", "href": "https://book.douban.com/subject/30210732/?icn=index-latestbook-subject", "title": "神奇女俠"}
{"author": "[美] 特德·焦亞", "href": "https://book.douban.com/subject/30203912/?icn=index-latestbook-subject", "title": "如何聽爵士"}
{"author": "鄧安慶", "href": "https://book.douban.com/subject/30221630/?icn=index-latestbook-subject", "title": "紙上王國"}
{"author": "朱偉", "href": "https://book.douban.com/subject/30205589/?icn=index-latestbook-subject", "title": "重讀八十年代"}
{"author": "鄧安慶", "href": "https://book.douban.com/subject/30190319/?icn=index-latestbook-subject", "title": "望花"}
{"author": "[美]沃爾特·李普曼", "href": "https://book.douban.com/subject/27662713/?icn=index-latestbook-subject", "title": "輿論"}
{"author": "[英] P•D•詹姆斯", "href": "https://book.douban.com/subject/27111572/?icn=index-latestbook-subject", "title": "人類之子"}
{"author": "駱儀", "href": "https://book.douban.com/subject/30198500/?icn=index-latestbook-subject", "title": "京都好物"}
{"author": "(美) 比爾·克林頓 (Bill Clinton)/[美] 詹姆斯·帕特森", "href": "https://book.douban.com/subject/30218923/?icn=index-latestbook-subject", "title": "失蹤的總統"}
{"author": "劉冰/林秦文/李敏", "href": "https://book.douban.com/subject/30203973/?icn=index-latestbook-subject", "title": "中國常見植物野外識別手冊（北京冊）"}
{"author": "冶文彪", "href": "https://book.douban.com/subject/30205286/?icn=index-latestbook-subject", "title": "清明上河圖密碼 5"}
{"author": "[英] 勞拉·卡琳/Laura Carlin", "href": "https://book.douban.com/subject/30181220/?icn=index-latestbook-subject", "title": "創造自己的世界"}
{"author": "郭強生", "href": "https://book.douban.com/subject/30217599/?icn=index-latestbook-subject", "title": "斷代"}
{"author": "史傑鵬", "href": "https://book.douban.com/subject/30183948/?icn=index-latestbook-subject", "title": "悠悠我心"}
{"author": "[俄] 柳德米拉·烏利茨卡婭", "href": "https://book.douban.com/subject/30205823/?icn=index-latestbook-subject", "title": "庫科茨基醫生的病案"}
{"author": "[美] 蘭德爾·柯林斯", "href": "https://book.douban.com/subject/30143236/?icn=index-latestbook-subject", "title": "文憑社會"}
{"author": "[法] 讓-皮埃爾·吉布拉", "href": "https://book.douban.com/subject/30205166/?icn=index-latestbook-subject", "title": "愛的緩刑"}
{"author": "[美]麗貝卡·特雷斯特", "href": "https://book.douban.com/subject/30128172/?icn=index-latestbook-subject", "title": "單身女性的時代"}
{"author": "[俄] 弗拉基米爾·索羅金", "href": "https://book.douban.com/subject/27200259/?icn=index-latestbook-subject", "title": "碲釘國"}
{"author": "蘇精", "href": "https://book.douban.com/subject/30218894/?icn=index-latestbook-subject", "title": "鑄以代刻"}
{"author": "[英] 石黑一雄", "href": "https://book.douban.com/subject/30181685/?icn=index-latestbook-subject", "title": "莫失莫忘"}

requests + re 爬去網站圖書資訊（Python）

requests + re 爬去網站圖書資訊（Python）

Python3爬蟲小程式——爬取各類天氣資訊（3）

移動端爬蟲--專案實踐loach--爬去抖音資料（四）

雪球網爬取上市公司資訊（一）：爬取上市公司代號

Django框架電商網站開發流程（Python）

實戰--Scrapy框架爬去網站資訊

scrapy爬取愛上租網站的房源資訊（一）

python 爬蟲 requests+BeautifulSoup 爬取巨潮資訊公司概況代碼實例

Python爬蟲：爬取網站電影資訊

python爬蟲爬取非同步載入網頁資訊（python抓取網頁中無法通過網頁標籤屬性抓取的內容）

Scrapy爬取前程無憂（51job）相關職位資訊

Python(16)_爬去百度圖片（urlopen和urlretrieve）

requests+beautifulsoup爬取豆瓣圖書

（python）如何利用python深入爬取自己想要的資料資訊

【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位資訊（1）

爬蟲requests庫簡單抓取頁面資訊功能實現（Python）

python 爬蟲爬取所有上市公司公告資訊（一）

python3 爬蟲—爬取天氣預報多個城市七天資訊（三）

爬蟲基本原理介紹和初步實現（以抓取噹噹網圖書資訊為例）

python 爬蟲爬取所有上市公司公告資訊（五）

requests + re 爬去網站圖書資訊（Python）

相關推薦