1. 程式人生 > >爬蟲實例:唐詩宋詞爬蟲

爬蟲實例:唐詩宋詞爬蟲

點滴 itl cnblogs @class .com 南方 xpath repl users

每年都期待夏天趕緊變成秋天,沒有木頭馬尾的九月,沒有顏色奇跡的南方,只得古詩詞裏把情緒商量,算雲煙,此處認春秋。

以流浪丈量秋天的面積,秋天的外面還是秋天。

以攀登探測秋天的深度,秋天的深處只有我。

基本分析

1.根據古詩詞網頁結構,可以看出詩詞正文有兩種結構,一種是p標簽分隔的,一種是br標簽分隔的

from lxml import etree

s = """
<div id="contson47919" class="contson">
<p>aaa<br>a</p>
<p>bb</p>
<p>c</p>
</div>
""" selector = etree.HTML(s) #s = selector.xpath(‘//*[@class="contson"]/p‘) #aaa #s = selector.xpath(‘string(//*[@class="contson"]/p)‘) #aaaa s = selector.xpath(string(//*[@class="contson"])) # aaaa \n bb \n c #print list(s) #[‘\n‘, ‘a‘, ‘a‘, ‘a‘, ‘a‘, ‘\n‘, ‘b‘, ‘b‘, ‘\n‘, ‘c‘, ‘\n‘] s = [i for i in s if
i != \n] #[‘a‘, ‘a‘, ‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘] s = ‘‘.join(s) print type(s) #<type ‘str‘> print s #aaaabbc import re s = """ <div id="contson47919" class="contson"> <p>  環滁皆山也。其西南諸峰,林壑尤美,望之蔚然而深秀者,瑯琊也。山行六七裏,漸聞水聲潺潺而瀉出於兩峰之間者,釀泉也。峰回路轉,有亭翼然臨於泉上者,醉翁亭也。作亭者誰?山之僧智仙也。名之者誰?太守自謂也。太守與客來飲於此,飲少輒醉,而年又最高,故自號曰醉翁也。醉翁之意不在酒,在乎山水之間也。山水之樂,得之心而寓之酒也。</p> <p>  若夫日出而林霏開,雲歸而巖穴暝,晦明變化者,山間之朝暮也。野芳發而幽香,佳木秀而繁陰,風霜高潔,水落而石出者,山間之四時也。朝而往,暮而歸,四時之景不同,而樂亦無窮也。</p> <p>  至於負者歌於途,行者休於樹,前者呼,後者應,傴僂提攜,往來而不絕者,滁人遊也。臨溪而漁,溪深而魚肥。釀泉為酒,泉香而酒洌;山肴野蔌,雜然而前陳者,太守宴也。宴酣之樂,非絲非竹,射者中,弈者勝,觥籌交錯,起坐而喧嘩者,眾賓歡也。蒼顏白發,頹然乎其間者,太守醉也。</p> <p>  已而夕陽在山,人影散亂,太守歸而賓客從也。樹林陰翳,鳴聲上下,遊人去而禽鳥樂也。然而禽鳥知山林之樂,而不知人之樂;人知從太守遊而樂,而不知太守之樂其樂也。醉能同其樂,醒能述以文者,太守也。太守謂誰?廬陵歐陽修也。</p> </div> <div id="contson49394" class="contson"> 大江東去,浪淘盡,千古風流人物。 <br> 故壘西邊,人道是,三國周郎赤壁。 <br> 亂石穿空,驚濤拍岸,卷起千堆雪。 <br> 江山如畫,一時多少豪傑。 <br> 遙想公瑾當年,小喬初嫁了,雄姿英發。 <br> 羽扇綸巾,談笑間,檣櫓灰飛煙滅。(檣櫓 一作:強虜) <br> 故國神遊,多情應笑我,早生華發。 <br> 人生如夢,一尊還酹江月。(人生 一作:人間;尊 通:樽) </div> <div id="contson52821" class="contson"> <p> 尋尋覓覓,冷冷清清,淒淒慘慘戚戚。乍暖還寒時候,最難將息。三杯兩盞淡酒,怎敵他、晚來風急?雁過也,正傷心,卻是舊時相識。 <br> 滿地黃花堆積。憔悴損,如今有誰堪摘?守著窗兒,獨自怎生得黑?梧桐更兼細雨,到黃昏、點點滴滴。這次第,怎一個愁字了得!(守著窗兒 一作:守著窗兒) </p> </div>
""" #SFiltered = re.sub("<br/>", " ",s) #SFiltered = re.sub("<p/>", " ",s) SFiltered = re.sub(r\<.*?\>," ",s).strip() print SFiltered #<type ‘str‘>

數據抓取

Version1 正則提取

import re

links = <a href="/type.aspx?p=1&c=%e5%94%90%e4%bb%a3">唐代</a><a href="/search.aspx?value=%e6%9d%8e%e7%99%bd">李白</a>
author = re.findall(r<a href="/search.aspx?value=% .*? >(.*?)</a>,links)
print author  #[]

text = <div id="contson71138" class="contson"> 山不在高,有仙則名。水不在深,有龍則靈。斯是陋室,惟吾德馨。苔痕上階綠,草色入簾青。談笑有鴻儒,往來無白丁。可以調素琴,閱金經。無絲竹之亂耳,無案牘之勞形。南陽諸葛廬,西蜀子雲亭。孔子雲:何陋之有? </div>
song = re.findall(r?<=<div id="contson\d+" class="contson"?<=>(.?*)?<=</div>,text) #報錯nothing to repeat
song = re.findall(r<div id="contson\d+" class="contson">(.?*)</div>,text)  #報錯multiple repeat
print song

註:提取失敗

Version2 Xpath提取

#coding=utf-8
import requests
from lxml import etree
import sys
import re
import pymongo

reload(sys)
sys.setdefaultencoding(utf-8)

#conn = pymongo.MongoClient(host=‘localhost‘,port=27017)  #先啟動mongoDB,再建立數據庫poetry和集合poem
#poetry = conn[‘poetry‘]
#newdata = poetry[‘poem‘]

urllist = [http://so.gushiwen.org/type.aspx?p={}&c=%e5%ae%8b%e4%bb%a3.format(i) for i in range(1,2)]
#print urllist


poemL = []

for url in urllist:
    print url
    html = requests.get(url).content
    #HtmlFiltered = html.replace(‘\n‘,‘‘).replace(‘\r‘,‘‘)
    #HtmlFiltered = re.sub(r‘\<.*?\>‘," ",HtmlFiltered).strip()
    tree = etree.HTML(html)
    rez = tree.xpath(//*[@class="sons"])
    print aaaeee
    for i in rez:
        poem1 = i.xpath(//*[@class="contson"]/p/text())
        poem2 = i.xpath(//*[@class="contson"]/text())

        title = i.xpath(//*[@class="cont"]/p[1]/a/b/text())
        dynasty = i.xpath(//*[@class="source"]/a[1]/text()) #從1開始
        author = i.xpath(//*[@class="source"]/a[2]/text())
    poemL.extend(poem1)
    poemL.extend(poem2)
    poemL = [i for i in poemL if i != \n]
    #poemL = ‘‘.join(poemL)
    #print poemL


print u詩:
count = 0
for i in poemL:
    count +=1
    print count, i

輸出:每頁十首詩,然而len(poemL)=32

技術分享

改進版

#coding=utf-8
import requests
from lxml import etree
import sys
import re
import pymongo

reload(sys)
sys.setdefaultencoding(utf-8)

conn = pymongo.MongoClient(host=localhost,port=27017)  #先啟動mongoDB,再建立數據庫poetry和集合poem
poetry = conn[poetry]
newdata = poetry[poemSong]

urllist = [http://so.gushiwen.org/type.aspx?p={}&c=%e5%94%90%e4%bb%a3.format(i) for i in range(1,501)]
#print urllist


for url in urllist:
    print url
    html = requests.get(url).content
    tree = etree.HTML(html)
    rez = tree.xpath(//*[@class="left"])
for i in rez:
        title = i.xpath(//*[@class="cont"]/p[1]/a/b/text())
        dynasty = i.xpath(//*[@class="source"]/a[1]/text()) 
        author = i.xpath(//*[@class="source"]/a[2]/text())
        
    poem = []
    for i in range(10):
        result = tree.xpath(//div[@class="contson"][starts-with(@id,"contson")])[i]
        info = result.xpath(string(.))
        content = info.replace(\n,‘‘).replace( ,‘‘)
        #print content
        #print type(content)  #<type ‘unicode‘>
        print len(content)
        poem.append(content)
        #break

    #打印輸出
    for i,j,m,n in zip (title,dynasty,author,poem):
        print i,|,j,|,m,|,n
        print \n
        break 
    
    #寫入文件
    for i in list(range(0,len(title))):
        text =,.join((title[i],dynasty[i],author[i],poem[i]))+\n
        with open(r"C:\Users\HP\Desktop\codes\DATA\poemTang.csv",a+) as file:
            file.write(text+ )
      

    #存入數據庫
    data = {title:title,
            author:author,
            dynasty:dynasty,
            poem:poem}
    newdata.insert_one(data)

print succeed

說明:通過contsonid提取元素,id獲取使用startwith()

輸出:

技術分享

5000首唐詩,5000首宋詞搬回家了,滿滿的成就感,哈哈哈

爬蟲實例:唐詩宋詞爬蟲