【Python爬蟲6】表單互動
嚴格來說,本篇表單互動和下一篇驗證碼處理不算是網路爬蟲,而是廣義上的網路機器人。使用網路機器人可以減少提取資料時需要表單互動的一道門檻。
1.手工處理髮送POST請求提交登入表單
我們先在示例網站手工註冊一個賬號,註冊這個賬號需要驗證碼,下一篇會介紹處理驗證碼問題。
1.1分析表單內容
我們在登入網址http://127.0.0.1:8000/places/default/user/login 獲得如下表單。在下面登入表單中包括幾個重要的組成部分:
- form標籤的action屬性:用於設定表單資料提交的地址,本例中為
#
,也就是和登入表單同一個URL; - form標籤的enctype屬性:用於設定資料提交的編碼,本例中為
application/x-www-form-urlencoded
multipart/form-data
編碼型別,這種編碼不會對輸入進行編碼從而不會影響效率,而是使用MIME協議將其作為多個部分進行傳送,和郵件的傳輸標準相同。文件:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding - form標籤的method屬性:本例中
post
表示通過請求體向伺服器提交表單資料; - imput標籤的name屬性:用於設定提交到伺服器端時某個域的名稱。
<form action="#" enctype= "application/x-www-form-urlencoded" method="post">
<table>
<tr id="auth_user_email__row">
<td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>
<td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="auth_user_password__row">
<td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>
<td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="auth_user_remember_me__row">
<td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>
<td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
<td class="w2p_fc"></td>
</tr>
<tr id="submit_record__row">
<td class="w2p_fl"></td><td class="w2p_fw">
<input type="submit" value="Log In" />
<button class="btn w2p-form-button" onclick="window.location='/places/default/user/register';return false">Register</button>
</td>
<td class="w2p_fc"></td>
</tr>
</table>
<div style="display:none;">
<input name="_next" type="hidden" value="/places/default/index" />
<input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />
<input name="_formname" type="hidden" value="login" />
</div>
</form>
1.2手工測試post請求提交表單
如果登入成功則跳到主頁,否則回到登入頁。下面是嘗試自動登入的初始版本程式碼。顯然登入失敗!
>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='[email protected]'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>
因為登入時還需要新增隱藏的_formkey
屬性,這個唯一的ID用來避免表單多次提交。每次載入網頁時,都會產生不同的ID,然後伺服器端就可以通過這個給定的ID來判斷表單是否已經通過提交過。下面是獲得該屬性值:
>>>
>>> import lxml.html
>>> def parse_form(html):
... tree=lxml.html.fromstring(html)
... data={}
... for e in tree.cssselect('form input'):
... if e.get('name'):
... data[e.get('name')]=e.get('value')
... return data
...
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
'_formname': 'login',
'_next': '/places/default/index',
'email': '',
'password': '',
'remember_me': 'on'}
>>>
下面是通過_formkey
和其他隱藏域的新版本自動登入程式碼。發現還是不成功!
>>>
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>>
因為我們缺失了一個重要的組成部分——cookie。當普通使用者載入登入表單時,_formkey
的值將會儲存在cookie中,然後該值會與提交的登入表單資料中的_formkey
的值進行對比。下面是使用urllib2.HTTPCookieProcessor
類增加了cookie支援之後的程式碼。最後登入成功了!
>>>
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>>
>>> html=opener.open(LOGIN_URL).read() #opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request) #opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>>
1.3手工處理post請求登入的完整原始碼:
# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html
LOGIN_EMAIL = '[email protected]'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'
def login_basic():
"""fails because not using formkey
"""
data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = urllib2.urlopen(request)
print response.geturl()
def login_formkey():
"""fails because not using cookies to match formkey
"""
html = urllib2.urlopen(LOGIN_URL).read()
data = parse_form(html)
data['email'] = LOGIN_EMAIL
data['password'] = LOGIN_PASSWORD
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = urllib2.urlopen(request)
print response.geturl()
def login_cookies():
"""working login
"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(LOGIN_URL).read()
data = parse_form(html)
data['email'] = LOGIN_EMAIL
data['password'] = LOGIN_PASSWORD
encoded_data = urllib.urlencode(data)
request = urllib2.Request(LOGIN_URL, encoded_data)
response = opener.open(request)
print response.geturl()
return opener
def parse_form(html):
"""extract all input properties from the form
"""
tree = lxml.html.fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data
def main():
#login_basic()
#login_formkey()
login_cookies()
if __name__ == '__main__':
main()
2.從FF瀏覽器載入cookie登入網站
我們先用手工執行登入,我們先在FF瀏覽器用手工執行登入,然後關閉FF瀏覽器,然後用python指令碼複用之前得到的cookie,從而實現自動登入。
2.1session檔案位置
FireFox在sqlist資料庫中儲存cookie,在json檔案中儲存session,這兩種儲存方式都可以直接通過Python獲取。對於登入操作而言,我們只需要獲致session即可。對於不同的作業系統,FireFox儲存的session檔案的位置不同:
- Linux系統:
~/.mozilla/firefox/*.default/sessionstore.js
- OS X系統:
~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
- Windows Vista及以上版本系統:
%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js
下面是返回session檔案路徑的輔助函式程式碼:
def find_ff_sessions():
paths = [
'~/.mozilla/firefox/*.default',
'~/Library/Application Support/Firefox/Profiles/*.default',
'%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
]
for path in paths:
filename = os.path.join(path, 'sessionstore.js')
matches = glob.glob(os.path.expanduser(filename))
if matches:
return matches[0]
注:glob
模組會返回指定路徑中所有匹配的檔案。
2.2FF瀏覽器cookie內容
下面是Linux系統火狐瀏覽器session檔案內容:
[email protected]:~/.mozilla/firefox/78n340f7.default$ ls
addons.json datareporting key3.db prefs.js storage
blocklist.xml extensions logins.json revocations.txt storage.sqlite
bookmarkbackups extensions.ini mimeTypes.rdf saved-telemetry-pings times.json
cert8.db extensions.json minidumps search.json.mozlz4 webapps
compatibility.ini features permissions.sqlite secmod.db webappsstore.sqlite
containers.json formhistory.sqlite places.sqlite sessionCheckpoints.json xulstore.json
content-prefs.sqlite gmp places.sqlite-shm sessionstore-backups
cookies.sqlite gmp-gmpopenh264 places.sqlite-wal sessionstore.js
crashes healthreport pluginreg.dat SiteSecurityServiceState.txt
[email protected]:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js
{"version":["sessionrestore",1],
"windows":[{
...
"cookies":[
{"host":"127.0.0.1",
"value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
"path":"/",
"name":"session_id_welcome",
"httponly":true,
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
{"host":"127.0.0.1",
"value":"True",
"path":"/",
"name":"session_id_places",
"httponly":true,
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
{"host":"127.0.0.1",
"value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
"path":"/",
"name":"session_data_places",
"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
],
"title":"Example web scraping website",
"_shouldRestore":true,
"closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}
[email protected]:~/.mozilla/firefox/78n340f7.default$
根據seesion儲存結構,我們用下面程式碼把session解析到CookieJar物件中。
def load_ff_sessions(session_filename):
cj = cookielib.CookieJar()
if os.path.exists(session_filename):
try:
json_data = json.loads(open(session_filename, 'rb').read())
except ValueError as e:
print 'Error parsing session JSON:', str(e)
else:
for window in json_data.get('windows', []):
for cookie in window.get('cookies', []):
import pprint; pprint.pprint(cookie)
c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''),
None, False,
cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'),
cookie.get('path', ''), False,
False, str(int(time.time()) + 3600 * 24 * 7), False,
None, None, {})
cj.set_cookie(c)
else:
print 'Session filename does not exist:', session_filename
return cj
2.3使用cookie測試載入登入
session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()
tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()
如果得到的結果是Login
則說明沒能正確載入。如果出現這樣情況,你就需要確認一下FireFox中是否已經成功登入救命網站。如果得到下面結果,有Welcome 使用者的first name
,則登入表示成功。
[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$ python 2login_firefox.py
{u'host': u'127.0.0.1',
u'httponly': True,
u'name': u'session_id_welcome',
u'originAttributes': {u'addonId': u'',
u'appId': 0,
u'inIsolatedMozBrowser': False,
u'privateBrowsingId': 0,
u'signedPkg': u'',
u'userContextId': 0},
u'path': u'/',
u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$
[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$ python 2login_firefox.py