1. 程式人生 > >【Python爬蟲6】表單互動

【Python爬蟲6】表單互動


嚴格來說,本篇表單互動和下一篇驗證碼處理不算是網路爬蟲,而是廣義上的網路機器人。使用網路機器人可以減少提取資料時需要表單互動的一道門檻。

1.手工處理髮送POST請求提交登入表單

我們先在示例網站手工註冊一個賬號,註冊這個賬號需要驗證碼,下一篇會介紹處理驗證碼問題。

1.1分析表單內容

我們在登入網址http://127.0.0.1:8000/places/default/user/login 獲得如下表單。在下面登入表單中包括幾個重要的組成部分:

  • form標籤的action屬性:用於設定表單資料提交的地址,本例中為#,也就是和登入表單同一個URL;
  • form標籤的enctype屬性:用於設定資料提交的編碼,本例中為application/x-www-form-urlencoded
    ,表示所有非字母數字的字元都需要轉換為十六進位制的ASCII值;上傳二程序檔案最好用multipart/form-data編碼型別,這種編碼不會對輸入進行編碼從而不會影響效率,而是使用MIME協議將其作為多個部分進行傳送,和郵件的傳輸標準相同。文件:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
  • form標籤的method屬性:本例中post表示通過請求體向伺服器提交表單資料;
  • imput標籤的name屬性:用於設定提交到伺服器端時某個域的名稱。
<form action="#" enctype=
"application/x-www-form-urlencoded"
method="post">
<table> <tr id="auth_user_email__row"> <td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td> <td class="w2p_fw"><input class="string" id="auth_user_email"
name="email" type="text" value="" />
</td> <td class="w2p_fc"></td> </tr> <tr id="auth_user_password__row"> <td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td> <td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td> <td class="w2p_fc"></td> </tr> <tr id="auth_user_remember_me__row"> <td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td> <td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td> <td class="w2p_fc"></td> </tr> <tr id="submit_record__row"> <td class="w2p_fl"></td><td class="w2p_fw"> <input type="submit" value="Log In" /> <button class="btn w2p-form-button" onclick="window.location=&#x27;/places/default/user/register&#x27;;return false">Register</button> </td> <td class="w2p_fc"></td> </tr> </table> <div style="display:none;"> <input name="_next" type="hidden" value="/places/default/index" /> <input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" /> <input name="_formname" type="hidden" value="login" /> </div> </form>

1.2手工測試post請求提交表單

如果登入成功則跳到主頁,否則回到登入頁。下面是嘗試自動登入的初始版本程式碼。顯然登入失敗!

>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='[email protected]'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因為登入時還需要新增隱藏的_formkey屬性,這個唯一的ID用來避免表單多次提交。每次載入網頁時,都會產生不同的ID,然後伺服器端就可以通過這個給定的ID來判斷表單是否已經通過提交過。下面是獲得該屬性值:

>>> 
>>> import lxml.html
>>> def parse_form(html):
...     tree=lxml.html.fromstring(html)
...     data={}
...     for e in tree.cssselect('form input'):
...             if e.get('name'):
...                     data[e.get('name')]=e.get('value')
...     return data
... 
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
 '_formname': 'login',
 '_next': '/places/default/index',
 'email': '',
 'password': '',
 'remember_me': 'on'}
>>> 

下面是通過_formkey和其他隱藏域的新版本自動登入程式碼。發現還是不成功!

>>> 
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因為我們缺失了一個重要的組成部分——cookie。當普通使用者載入登入表單時,_formkey的值將會儲存在cookie中,然後該值會與提交的登入表單資料中的_formkey的值進行對比。下面是使用urllib2.HTTPCookieProcessor類增加了cookie支援之後的程式碼。最後登入成功了!

>>> 
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> 
>>> html=opener.open(LOGIN_URL).read()		#opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request)		#opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>> 

1.3手工處理post請求登入的完整原始碼:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html

LOGIN_EMAIL = '[email protected]'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'


def login_basic():
    """fails because not using formkey
    """
    data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_formkey():
    """fails because not using cookies to match formkey
    """
    html = urllib2.urlopen(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_cookies():
    """working login
    """
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = opener.open(request)
    print response.geturl()
    return opener

def parse_form(html):
    """extract all input properties from the form
    """
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def main():
    #login_basic()
    #login_formkey()
    login_cookies()

if __name__ == '__main__':
    main()

2.從FF瀏覽器載入cookie登入網站

我們先用手工執行登入,我們先在FF瀏覽器用手工執行登入,然後關閉FF瀏覽器,然後用python指令碼複用之前得到的cookie,從而實現自動登入。

2.1session檔案位置

FireFox在sqlist資料庫中儲存cookie,在json檔案中儲存session,這兩種儲存方式都可以直接通過Python獲取。對於登入操作而言,我們只需要獲致session即可。對於不同的作業系統,FireFox儲存的session檔案的位置不同:

  • Linux系統:~/.mozilla/firefox/*.default/sessionstore.js
  • OS X系統:~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
  • Windows Vista及以上版本系統:%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js

下面是返回session檔案路徑的輔助函式程式碼:

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

注:glob模組會返回指定路徑中所有匹配的檔案。

2.2FF瀏覽器cookie內容

下面是Linux系統火狐瀏覽器session檔案內容:

[email protected]:~/.mozilla/firefox/78n340f7.default$ ls
addons.json           datareporting       key3.db             prefs.js                      storage
blocklist.xml         extensions          logins.json         revocations.txt               storage.sqlite
bookmarkbackups       extensions.ini      mimeTypes.rdf       saved-telemetry-pings         times.json
cert8.db              extensions.json     minidumps           search.json.mozlz4            webapps
compatibility.ini     features            permissions.sqlite  secmod.db                     webappsstore.sqlite
containers.json       formhistory.sqlite  places.sqlite       sessionCheckpoints.json       xulstore.json
content-prefs.sqlite  gmp                 places.sqlite-shm   sessionstore-backups
cookies.sqlite        gmp-gmpopenh264     places.sqlite-wal   sessionstore.js
crashes               healthreport        pluginreg.dat       SiteSecurityServiceState.txt
[email protected]:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js 
{"version":["sessionrestore",1],
"windows":[{
	...
	"cookies":[
		{"host":"127.0.0.1",
		"value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
		"path":"/",
		"name":"session_id_welcome",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"True",
		"path":"/",
		"name":"session_id_places",
		"httponly":true,
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
		{"host":"127.0.0.1",
		"value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
		"path":"/",
		"name":"session_data_places",
		"originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
	],
	"title":"Example web scraping website",
	"_shouldRestore":true,
	"closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}

[email protected]:~/.mozilla/firefox/78n340f7.default$ 

根據seesion儲存結構,我們用下面程式碼把session解析到CookieJar物件中。

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

2.3使用cookie測試載入登入

session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()

tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()

如果得到的結果是Login則說明沒能正確載入。如果出現這樣情況,你就需要確認一下FireFox中是否已經成功登入救命網站。如果得到下面結果,有Welcome 使用者的first name,則登入表示成功。

[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$ 
[email protected]:~/GitHub/WebScrapingWithPython/6.表單互動$ python 2login_firefox.py