1. 程式人生 > >Python爬蟲系列-PyQuery詳解

Python爬蟲系列-PyQuery詳解

強大又靈活的網頁解析庫。如果你覺得正則寫起來太麻煩,如果你覺得BeautifulSoup語法太難記,如果你熟悉jQuery的語法,那麼PyQuery就是你的最佳選擇。

安裝

pip3 install pyquery

用法講解

字串初始化

html='''
 <div>
   <ul>
     <li class="item-0">first item</li>
     <li class="item-1"><a href="link2.html">second item</a></li>
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
     <li class="item-1 active"><a href="link4.html">fourth item</a></li>
     <li class="item-0"><a href="link5.html">fifth item</a></li>
   </ul>
 </div>
 '''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))

顯示效果如下:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

URL初始化

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com',encoding='utf-8')
print(doc('head'))

直接輸入網址,顯示效果如下:

<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 

檔案初始化

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

基本CSS選擇器

 html = '''<div id="container">\n  <ul class="list">\n    <li class="item-0">first item</li>\n    <li class="item-1"><a href="link2.html">second item</a></li>\n    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>\n    <li class="item-1 active"><a href="link4.html">fourth item</a></li>\n    <li class="item-0"><a href="link5.html">fifth item</a></li>\n  </ul>\n</div>
... '''
 from pyquery import PyQuery as pq
 doc = pq(html)
 print(doc('#container .list li'))

輸出效果:

    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>

查詢元素

子元素

 html = '''
<div id="container">\n  <ul class="list">\n    <li class="item-0">first item</li>\n    <li class="item-1"><a href="link2.html">second item</a></li>\n    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>\n    <li class="item-1 active"><a href="link4.html">fourth item</a></li>\n    <li class="item-0"><a href="link5.html">fifth item</a></li>\n  </ul>\n</div>\n'''
from pyquery import PyQuery as pq
 doc = pq(html)
 items = doc('.list')
 print(type(items))
 print(items)

顯示如下結果:
<class 'pyquery.pyquery.PyQuery'>

  <ul class="list">
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
  </ul>
 lis = items.find('li')
 print(type(lis))
 print(lis)

顯示結果:
<class 'pyquery.pyquery.PyQuery'>

    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>