1. 程式人生 > >Python爬蟲:lxml模組分析並獲取網頁內容

Python爬蟲:lxml模組分析並獲取網頁內容

運用css選擇器:

# -*- coding: utf-8 -*-
from lxml import html
page_html = '''
<html><body>
<input id="input_id" value="input value" name="input_a">
</body></html>
'''
page_tree = html.fromstring(page_html.decode('utf-8'))
ele = page_tree.cssselect('#input_id')  # 用css選擇器的id選擇器獲取網頁內容
print html.tostring(ele[0]) # <input id="input_id" value="input value" name="input_a"> print ele # [<InputElement 30133f0 name='input_a' type='text'>] print ele[0] # <InputElement 30133f0 name='input_a' type='text'> print ele[0].get('value') # input value

獲取標籤裡的內容:

# -*- coding: utf-8 -*-
from lxml import html page_html = ''' <html><body> <div class="cl">DIV1</div> <div class="cl">DIV2</div> </body></html> ''' page_tree = html.fromstring(page_html.decode('utf-8')) ele = page_tree.cssselect('body')[0].findall("div") # findall尋找所有的直接子標籤 print
ele[0].text_content().strip() # DIV1

若提示如下錯誤:
from lxml import html
ImportError: DLL load failed: %1 is not a valid Win32 application.
嘗試重新安裝lxml模組:

python -m pip uninstall lxml
python -m pip install lxml==3.6.0