Python3使用BeautifulSoup4爬取《三國演義》

阿新 • • 發佈：2017-05-30

文章解析器 end read print menu contents htm 地址

#!/sur/bin/python
#conding=utf-8
import urllib.request
from bs4 import BeautifulSoup
url="http://www.shicimingju.com/book/sanguoyanyi.html" # 要爬取的網絡地址
menuCode=urllib.request.urlopen(url).read()  # 將網頁源代碼賦予menuCode
soup=BeautifulSoup(menuCode,‘html.parser‘)  # 使用html解析器進行解析
menu=soup.find_all(id="mulu")  # 在soup中找到id為mulu的節點
values = ‘,‘.join(str(v) for v in menu) # 將 menu轉換為str類型
soup2=BeautifulSoup(values,‘html.parser‘)
soup2=soup2.ul  # 用子節點代替soup2
print("-------------------soup2.contents----------------------------")
print(soup2.contents)
bookName=soup.h1.string # 找到了書名
print(u"----------------------‘書名‘------------------------------")
print(u"書名："+bookName)
f=open(‘D://‘+bookName+‘.doc‘,‘a‘,encoding=‘utf8‘)
f.write(bookName+"\n")#寫入書名
Desc=soup.p.get_text() #簡介
f.write(Desc+"\n")#寫入簡介
print(u"---------------------‘簡介‘------------------------------")
print(Desc)
bookMenu=[] # 章節list
bookMenuUrl=[] # 章節url的list
#遍歷list要in len(list)-1,因為list第一個元素list[0]
print(u"----------------------------章節和對應的url鏈接----------------------------")
for i in range(1,len(soup2.contents)-1): # 依次爬取書的章節
  bookMenu.append(soup2.contents[i].string)
  bookMenuUrl.append(soup2.contents[i].a[‘href‘])
  con=u‘章節：%s,URL：%s‘ %(soup2.contents[i].string,soup2.contents[i].a[‘href‘])
  print(con)
  f.write(con+"\n")#寫入章節以及對應的URL鏈接
#獲取文章內容：
"""
通過遍歷章節的url來獲取每個url對應的文章內容。
"""
urlBegin="http://www.shicimingju.com" #初始URL
for i  in  range (0,len(bookMenuUrl)):# 依次替換每個章節的url，讀取每章頁面的內容
 chapterCode=urllib.request.urlopen(urlBegin+bookMenuUrl[i]).read()#拼接成完整的URL，然後讀出內容
 chapterSoup=BeautifulSoup(chapterCode,‘html.parser‘) # 使用BS讀取解析網頁代碼
 chapterResult=chapterSoup.find_all(id=‘con2‘)  # 找到id=‘con2’的節點
 chapterResult = ‘,‘.join(str(v) for v in chapterResult) # 將節點內的代碼轉為str類型
 chapterSoup2=BeautifulSoup(chapterResult,‘html.parser‘) # 使用BS解析節點內代碼
 # print(chapterSoup2.contents)
 chapterText=chapterSoup2.get_text()#獲取文檔內容
 print(chapterText)
 f.write(bookMenu[i]) # 寫入文件每章標題
 f.write(chapterText)

文章解析器 end read print menu contents htm 地址 #!/sur/bin/python#conding=utf-8import urllib.requestfrom bs4 import BeautifulSoupurl="http://w

Python爬取《三國演義》並且製作詞雲

前提廢話之前關注了一個python的公眾號，每天都會推送文章，每次看都會看到他有使用wordcloud這個庫來生成好看的詞雲，於是乎，我就學習了jieba分詞和wordcloud詞雲。這裡給win系統的小夥伴提示下，如果你的pip install w

Python模擬登入豆瓣網，並爬取小組信息

count alias pass spa .post windows chrome apr ror import requests from bs4 import BeautifulSoup from PIL import Image headers = { ‘

用接口爬取今日頭條圖片

b+ req ace nco ext odin api data utf #encoding:utf8import requestsimport jsonimport redemo = requests.get(‘http://www.toutiao.com/api/pc/

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

20170513爬取貓眼電影Top100

top compile bs4 etime http res XML n) quest import jsonimport reimport requestsfrom bs4 import BeautifulSoupfrom requests import RequestE

python爬取網易雲音樂歌單音樂

string attrs default textarea bsp color read contents dom 在網易雲音樂中第一頁歌單的url：http://music.163.com/#/discover/playlist/ 依次第二頁：http://music.1

網頁內容爬取：如何提取正文內容 BEAUTIFULSOUP的輸出

總計排除 XML html pack prettify 樣式 start ack 創建一個新網站，一開始沒有內容，通常需要抓取其他人的網頁內容，一般的操作步驟如下：根據url下載網頁內容，針對每個網頁的html結構特征，利用正則表達式，或者其他的方式，做文本解析，提取出

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

常用正則表達式爬取網頁信息及HTML分析總結

logfile mpi 開始 order 標題 ear 爬取網頁常用 enter Python爬取網頁信息時，經常使用的正則表達式及方法。 1.獲取<tr></tr>標簽之間內容 2.獲取<a href..></a>超鏈接

使用 Chrome 瀏覽器插件 Web Scraper 10分鐘輕松實現網頁數據的爬取

tle 中文 host avi true bre 註冊分屏 idt 本文標簽： WebScraper Chrome瀏覽器插件網頁數據的爬取使用 Chrome 瀏覽器插件 Web Scraper 可以輕松實現網頁數據的爬取，不寫代碼，鼠標操作，點哪爬哪，還不用考慮爬蟲中

python爬蟲：爬取網站視頻

爬蟲 python python爬取百思不得姐網站視頻：http://www.budejie.com/video/新建一個py文件，代碼如下：#!/usr/bin/python # -*- coding: UTF-8 -*- import urllib,re,requests import sys

python 爬取qidian某一頁全部小說

decode return data- dib read etc break beautiful range 1 import re 2 import urllib.request 3 from bs4 import BeautifulSou

Python爬取今日頭條段子

找到 eat 修改是什麽一次時間地址 style 用戶名剛入門Python爬蟲，試了下爬取今日頭條官網中的段子，網址為https://www.toutiao.com/ch/essay_joke/源碼比較簡陋，如下： 1 import requests 2 i

通過python的urllib.request庫來爬取一只貓

com cat alt cnblogs write amazon 技術分享 color lac 我們實驗的網站很簡單，就是一個關於貓的圖片的網站：http://placekitten.com 代碼如下： import urllib.request respond =

利用python爬取龍虎榜數據及後續分析

登錄 one 可能股市 .com 爬蟲但我由於相關 ##之前已經有很多人寫過相關內容，但我之前並未閱讀過，這個爬蟲也是按照自己的思路寫的，可能比較醜陋，請見諒！本人作為Python爬蟲新手和股市韭菜，由於時間原因每晚沒辦法一個個翻龍虎榜數據，所以希望借助爬蟲篩選出

全網爬取6500多只基金|看看哪家基金最強

最大方案鏈接沒有編號時間 json src http .理財是個非常重要的技能，無論是高高在上的成功人士還說大眾老百姓都必須要掌握的技能，俗話說"人不理財，財不理你"。理財的方法有很多，我個人比較喜歡買基金，而基金又有很分很多種：股票型，混合型，債券型，QDII還

店鋪商品id爬取

sel eat avd sql conn quest code import port import requests from bs4 import BeautifulSoup import lxml import re import time import rand

python設置代理IP來爬取拉勾網上的職位信息，

chrome https htm input post 進行 work port ota import requests import json import time position = input(‘輸入你要查詢的職位：‘) url = ‘https://www

Node.js爬蟲-爬取慕課網課程信息

reac 分享 function apt txt sta eject 賦值 find 第一次學習Node.js爬蟲，所以這時一個簡單的爬蟲，Node.js的好處就是可以並發的執行這個爬蟲主要就是獲取慕課網的課程信息，並把獲得的信息存儲到一個文件中，其中要用到cheerio

Python3使用BeautifulSoup4爬取《三國演義》

相關推薦