1. 程式人生 > >Python爬蟲爬取糗事百科(xpath+re)

Python爬蟲爬取糗事百科(xpath+re)

爬取糗事百科,用xpath、re提取

===================================================

=====================================================

 1 '''
 2 爬取醜事百科, 頁面自己來找
 3 分析:
 4 1. 需要用到requests爬去頁面,用xpath、re來提取數字
 5 2. 可提取資訊誰使用者頭像連結,段子內容,點贊,好評次數
 6 3. 儲存到json檔案中
 7 
 8 大致分三部分
 9 1. down下頁面
10 2。 利用xpath提取資訊
11 3. 儲存檔案落地 12 ''' 13 14 import requests 15 from lxml import etree 16 17 url = "https://www.qiushibaike.com/8hr/page/1/" 18 headers = { 19 "User-Agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36', 20 "Accept":'ext/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
', 21 "Accept-Language":'zh-CN,zh;q=0.9' 22 } 23 24 # 下載頁面 25 rsp = requests.get(url, headers=headers) 26 html = rsp.text 27 28 # 把頁面解析成html 29 html = etree.HTML(html) 30 print(html.text) 31 rst = html.xpath('//div[contains(@id, "qiushi_tag")]') 32 33 for r in rst: 34 print(r.text) 35 item = {}
36 print(r) 37 38 content = r.xpath('//div[@class="content"]/span')[0].text 39 print(content)