爬蟲實例:唐詩宋詞爬蟲
阿新 • • 發佈:2017-10-09
點滴 itl cnblogs @class .com 南方 xpath repl users
每年都期待夏天趕緊變成秋天,沒有木頭馬尾的九月,沒有顏色奇跡的南方,只得古詩詞裏把情緒商量,算雲煙,此處認春秋。
以流浪丈量秋天的面積,秋天的外面還是秋天。
以攀登探測秋天的深度,秋天的深處只有我。
基本分析
1.根據古詩詞網頁結構,可以看出詩詞正文有兩種結構,一種是p標簽分隔的,一種是br標簽分隔的
from lxml import etree s = """ <div id="contson47919" class="contson"> <p>aaa<br>a</p> <p>bb</p> <p>c</p> </div>""" selector = etree.HTML(s) #s = selector.xpath(‘//*[@class="contson"]/p‘) #aaa #s = selector.xpath(‘string(//*[@class="contson"]/p)‘) #aaaa s = selector.xpath(‘string(//*[@class="contson"])‘) # aaaa \n bb \n c #print list(s) #[‘\n‘, ‘a‘, ‘a‘, ‘a‘, ‘a‘, ‘\n‘, ‘b‘, ‘b‘, ‘\n‘, ‘c‘, ‘\n‘] s = [i for i in s ifi != ‘\n‘] #[‘a‘, ‘a‘, ‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘] s = ‘‘.join(s) print type(s) #<type ‘str‘> print s #aaaabbc import re s = """ <div id="contson47919" class="contson"> <p> 環滁皆山也。其西南諸峰,林壑尤美,望之蔚然而深秀者,瑯琊也。山行六七裏,漸聞水聲潺潺而瀉出於兩峰之間者,釀泉也。峰回路轉,有亭翼然臨於泉上者,醉翁亭也。作亭者誰?山之僧智仙也。名之者誰?太守自謂也。太守與客來飲於此,飲少輒醉,而年又最高,故自號曰醉翁也。醉翁之意不在酒,在乎山水之間也。山水之樂,得之心而寓之酒也。</p> <p> 若夫日出而林霏開,雲歸而巖穴暝,晦明變化者,山間之朝暮也。野芳發而幽香,佳木秀而繁陰,風霜高潔,水落而石出者,山間之四時也。朝而往,暮而歸,四時之景不同,而樂亦無窮也。</p> <p> 至於負者歌於途,行者休於樹,前者呼,後者應,傴僂提攜,往來而不絕者,滁人遊也。臨溪而漁,溪深而魚肥。釀泉為酒,泉香而酒洌;山肴野蔌,雜然而前陳者,太守宴也。宴酣之樂,非絲非竹,射者中,弈者勝,觥籌交錯,起坐而喧嘩者,眾賓歡也。蒼顏白發,頹然乎其間者,太守醉也。</p> <p> 已而夕陽在山,人影散亂,太守歸而賓客從也。樹林陰翳,鳴聲上下,遊人去而禽鳥樂也。然而禽鳥知山林之樂,而不知人之樂;人知從太守遊而樂,而不知太守之樂其樂也。醉能同其樂,醒能述以文者,太守也。太守謂誰?廬陵歐陽修也。</p> </div> <div id="contson49394" class="contson"> 大江東去,浪淘盡,千古風流人物。 <br> 故壘西邊,人道是,三國周郎赤壁。 <br> 亂石穿空,驚濤拍岸,卷起千堆雪。 <br> 江山如畫,一時多少豪傑。 <br> 遙想公瑾當年,小喬初嫁了,雄姿英發。 <br> 羽扇綸巾,談笑間,檣櫓灰飛煙滅。(檣櫓 一作:強虜) <br> 故國神遊,多情應笑我,早生華發。 <br> 人生如夢,一尊還酹江月。(人生 一作:人間;尊 通:樽) </div> <div id="contson52821" class="contson"> <p> 尋尋覓覓,冷冷清清,淒淒慘慘戚戚。乍暖還寒時候,最難將息。三杯兩盞淡酒,怎敵他、晚來風急?雁過也,正傷心,卻是舊時相識。 <br> 滿地黃花堆積。憔悴損,如今有誰堪摘?守著窗兒,獨自怎生得黑?梧桐更兼細雨,到黃昏、點點滴滴。這次第,怎一個愁字了得!(守著窗兒 一作:守著窗兒) </p> </div>""" #SFiltered = re.sub("<br/>", " ",s) #SFiltered = re.sub("<p/>", " ",s) SFiltered = re.sub(r‘\<.*?\>‘," ",s).strip() print SFiltered #<type ‘str‘>
數據抓取
Version1 正則提取
import re links = ‘<a href="/type.aspx?p=1&c=%e5%94%90%e4%bb%a3">唐代</a><a href="/search.aspx?value=%e6%9d%8e%e7%99%bd">李白</a>‘ author = re.findall(r‘<a href="/search.aspx?value=% .*? >(.*?)</a>‘,links) print author #[] text = ‘<div id="contson71138" class="contson"> 山不在高,有仙則名。水不在深,有龍則靈。斯是陋室,惟吾德馨。苔痕上階綠,草色入簾青。談笑有鴻儒,往來無白丁。可以調素琴,閱金經。無絲竹之亂耳,無案牘之勞形。南陽諸葛廬,西蜀子雲亭。孔子雲:何陋之有? </div>‘ song = re.findall(r‘?<=<div id="contson\d+" class="contson"?<=>(.?*)?<=</div>‘,text) #報錯nothing to repeat song = re.findall(r‘<div id="contson\d+" class="contson">(.?*)</div>‘,text) #報錯multiple repeat print song
註:提取失敗
Version2 Xpath提取
#coding=utf-8 import requests from lxml import etree import sys import re import pymongo reload(sys) sys.setdefaultencoding(‘utf-8‘) #conn = pymongo.MongoClient(host=‘localhost‘,port=27017) #先啟動mongoDB,再建立數據庫poetry和集合poem #poetry = conn[‘poetry‘] #newdata = poetry[‘poem‘] urllist = [‘http://so.gushiwen.org/type.aspx?p={}&c=%e5%ae%8b%e4%bb%a3‘.format(i) for i in range(1,2)] #print urllist poemL = [] for url in urllist: print url html = requests.get(url).content #HtmlFiltered = html.replace(‘\n‘,‘‘).replace(‘\r‘,‘‘) #HtmlFiltered = re.sub(r‘\<.*?\>‘," ",HtmlFiltered).strip() tree = etree.HTML(html) rez = tree.xpath(‘//*[@class="sons"]‘) print ‘aaaeee‘ for i in rez: poem1 = i.xpath(‘//*[@class="contson"]/p/text()‘) poem2 = i.xpath(‘//*[@class="contson"]/text()‘) title = i.xpath(‘//*[@class="cont"]/p[1]/a/b/text()‘) dynasty = i.xpath(‘//*[@class="source"]/a[1]/text()‘) #從1開始 author = i.xpath(‘//*[@class="source"]/a[2]/text()‘) poemL.extend(poem1) poemL.extend(poem2) poemL = [i for i in poemL if i != ‘\n‘] #poemL = ‘‘.join(poemL) #print poemL print u‘詩:‘ count = 0 for i in poemL: count +=1 print count, i
輸出:每頁十首詩,然而len(poemL)=32
改進版:
#coding=utf-8 import requests from lxml import etree import sys import re import pymongo reload(sys) sys.setdefaultencoding(‘utf-8‘) conn = pymongo.MongoClient(host=‘localhost‘,port=27017) #先啟動mongoDB,再建立數據庫poetry和集合poem poetry = conn[‘poetry‘] newdata = poetry[‘poemSong‘] urllist = [‘http://so.gushiwen.org/type.aspx?p={}&c=%e5%94%90%e4%bb%a3‘.format(i) for i in range(1,501)] #print urllist for url in urllist: print url html = requests.get(url).content tree = etree.HTML(html) rez = tree.xpath(‘//*[@class="left"]‘) for i in rez: title = i.xpath(‘//*[@class="cont"]/p[1]/a/b/text()‘) dynasty = i.xpath(‘//*[@class="source"]/a[1]/text()‘) author = i.xpath(‘//*[@class="source"]/a[2]/text()‘) poem = [] for i in range(10): result = tree.xpath(‘//div[@class="contson"][starts-with(@id,"contson")]‘)[i] info = result.xpath(‘string(.)‘) content = info.replace(‘\n‘,‘‘).replace(‘ ‘,‘‘) #print content #print type(content) #<type ‘unicode‘> print len(content) poem.append(content) #break #打印輸出 for i,j,m,n in zip (title,dynasty,author,poem): print i,‘|‘,j,‘|‘,m,‘|‘,n print ‘\n‘ break #寫入文件 for i in list(range(0,len(title))): text =‘,‘.join((title[i],dynasty[i],author[i],poem[i]))+‘\n‘ with open(r"C:\Users\HP\Desktop\codes\DATA\poemTang.csv",‘a+‘) as file: file.write(text+‘ ‘) #存入數據庫 data = {‘title‘:title, ‘author‘:author, ‘dynasty‘:dynasty, ‘poem‘:poem} newdata.insert_one(data)
print ‘succeed‘
說明:通過contsonid提取元素,id獲取使用startwith()
輸出:
5000首唐詩,5000首宋詞搬回家了,滿滿的成就感,哈哈哈
爬蟲實例:唐詩宋詞爬蟲