1. 程式人生 > >python學習(24) 使用Xpath解析並抓取美女圖片

python學習(24) 使用Xpath解析並抓取美女圖片

Xpath最初用來處理XML解析,同樣適用於HTML文件處理。相比正則表示式更方便一些

Xpath基本規則

nodename   表示選取nodename 節點的所有子節點
/          表示當前節點的直接子節點
//         表示當前節點的子節點和孫子節點
.          表示當前節點
..         當前節點的父節點
@          選取屬性

 

下面舉例使用下

text = '''
<div class="bus_vtem">
        <a href="https://www.aisinei.org/thread-17826-1-1.html" title="XINGYAN星顏社 2018.11.09 VOL.096 唐思琪 [47+1P]" class="preview"  target="_blank">
        <img src="https://i.asnpic.win/block/74/74eab64cfa4229d58c19a64970368178.jpg" width="250" height="375" alt="XINGYAN星顏社 2018.11.09 VOL.096 唐思琪 [47+1P]"/>
                <span class="bus_listag">XINGYAN星顏社</span>
        </a>
        <a href="https://www.aisinei.org/thread-17826-1-1.html" title="XINGYAN星顏社 2018.11.09 VOL.096 唐思琪 [47+1P]"  target="_blank">
            <div class="lv-face"><img src="https://www.aisinei.org/uc_server/avatar.php?uid=2&size=small" alt="釋出組小樂"/></div>
            <div class="t">XINGYAN星顏社 2018.11.09 VOL.096 唐思琪 </div>
            <div class="i"><span><i class="bus_showicon bus_showicon_v"></i>5401</span><span><i class="bus_showicon bus_showicon_r"></i>1</span></div>
        </a>
    </div>
''' from lxml import etree html = etree.HTML(text) result = etree.tostring(html) #列印lxml生成的字串,如果html格式不全,會自動補全 print(result.decode('utf-8')) # 列印根節點下所有子孫節點 result2 = html.xpath('//*') print(result2) result3 = html.xpath('//a[@class="preview"]') print(result3)

 

result.decode(‘utf-8’) 可以補全缺失的html格式字串
html.xpath(‘//*’)查詢根節點下所有子孫節點
html.xpath(‘//a[@class=”preview”]’) 在根節點所有子孫節點中找到屬性class為preview的a節點。

lxml同樣可以讀取檔案

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())

 

lxml 操作子節點

from lxml import etree
html = etree.HTML(text)
result = html.xpath('//bus/a')

操作父節點

from lxml import etree
html = etree.HTML(text)
result = html.xpath('//a[@class="preview"]/../@class
') print(result)

先找到class屬性為preview的a節點,然後找到其父節點,接著篩選父節點的class屬性,列印結果為[‘bus_vtem’]

屬性匹配

上面已經寫過了格式為: 節點名[@屬性名=”屬性值”]

屬性獲取

上面已經謝過了,格式為: 節點名/@屬性名,注意這裡沒有[]

多屬性值匹配

上面的節點bus 屬性class 只有一個值bus_vtem,如果新增一個值mtest,那麼屬性匹配要更換為contains,不然會報錯

from lxml import etree
text2 = '''
        <div class="bus_vtem  mtest"> hurricane!
        </div>
    '''
html2 = etree.HTML(text2)    
result5 = html2.xpath('//*[contains(@class, "mtest")]')
# 錯誤用法
#result5 = html.xpath('//*[@class="mtest"]')
print(result5)

 

多屬性匹配

多屬性匹配用於篩選一個節點時非常方便,各個屬性的判斷可以用 and or != == 等操作

from lxml import etree
text3 = '''
        <div class="bus_vtem mtest" name="hurricane"> hurricane!
        </div>
        <div class="bus_vtem mtest" name = "tornado"> tornado!
        </div>
    '''
html3 = etree.HTML(text3)    
result6 = html3.xpath('//*[contains(@class, "mtest") and @name="hurricane"]/text()')
print(result6)

 

文字獲取

在節點後加/text()即可,如
result6 = html3.xpath(‘//*[contains(@class, “mtest”) and @name=”hurricane”]/text()’)

下面結合前邊講述的request,cookie,以及今天的lxml知識,實戰爬取艾絲新發布的美女圖片地址,程式碼如下

import requests
import re
import time
from lxml import etree

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
COOKIES = r'__cfduid=d78f862232687ba4aae00f617c0fd1ca81537854419; bg5D_2132_saltkey=jh7xllgK; bg5D_2132_lastvisit=1540536781; bg5D_2132_auth=479fTpQgthFjwwD6V1Xq8ky8wI2dzxJkPeJHEZyv3eqJqdTQOQWE74ttW1HchIUZpgsyN5Y9r1jtby9AwfRN1R89; bg5D_2132_lastcheckfeed=7469%7C1541145866; bg5D_2132_st_p=7469%7C1541642338%7Cda8e3f530a609251e7b04bfc94edecec; bg5D_2132_visitedfid=52; bg5D_2132_viewid=tid_14993; bg5D_2132_ulastactivity=caf0lAnOBNI8j%2FNwQJnPGXdw6EH%2Fj6DrvJqB%2Fvv6bVWR7kjOuehk; bg5D_2132_smile=1D1; bg5D_2132_seccode=22485.58c095bd095d57b101; bg5D_2132_lip=36.102.208.214%2C1541659184; bg5D_2132_sid=mElHBZ; Hm_lvt_b8d70b1e8d60fba1e9c8bd5d6b035f4c=1540540375,1540955353,1541145834,1541562930; Hm_lpvt_b8d70b1e8d60fba1e9c8bd5d6b035f4c=1541659189; bg5D_2132_sendmail=1; bg5D_2132_checkpm=1; bg5D_2132_lastact=1541659204%09misc.php%09patch'
class AsScrapy(object):
    def __init__(self,pages=1):
        try:
            self.m_session = requests.Session()
            self.m_headers = {'User-Agent':USER_AGENT,
                        #'referer':'https://www.aisinei.org/',
                        }
           
            self.m_cookiejar = requests.cookies.RequestsCookieJar()
            for cookie in COOKIES.split(';'):
                key,value = cookie.split('=',1)
                self.m_cookiejar.set(key,value)
        except:
            print('init error!!!')
    def getOverView(self):
        try:
            req = self.m_session.get('https://www.aisinei.org/portal.php',headers=self.m_headers, cookies=self.m_cookiejar, timeout=5)
            html = etree.HTML(req.content.decode('utf-8'))
            #result=html.xpath('//div[@class="bus_vtem"]/a[@title!="緊急通知!緊急通知!緊急通知!"]/attribute::*')
            #print(result)
            htmllist = html.xpath('//div[@class="bus_vtem"]/a[@title!="緊急通知!緊急通知!緊急通知!" and @class="preview"]/@href')
            titlelist = html.xpath('//div[@class="bus_vtem"]/a[@title!="緊急通知!緊急通知!緊急通知!" and @class="preview"]/@title')
            print(htmllist)
            print(titlelist)
            print(len(htmllist))
            print(len(titlelist))            
            time.sleep(1)
            pass
        except:
            print('get over view error')

if __name__ == "__main__":
    asscrapy = AsScrapy()
    asscrapy.getOverView()

 

通過lxml分析,可以摘取資源地址
1.png
接下來爬取圖片,讀者可以傳送request請求即可,留作課後題吧。
原始碼下載地址
https://github.com/secondtonone1/python-
謝謝關注我的公眾號
wxgzh.jpg