Python 模擬登入知乎

阿新 • • 發佈：2019-02-04

前言

前天看到一個爬取了知乎50多萬評論的帖子，羨慕的同時也想自己來嘗試一下。看看能不能獲取一些有價值的資訊。

必備知識點

下面簡單的來談談我對常見的防爬蟲的一些技巧的理解。

headers

現在很多伺服器都對爬蟲進行了限制，有一個很通用的處理就是檢測“客戶端”的headers。通過這個簡單的判斷就可以判斷出客戶端是爬蟲程式還是真實的使用者。（雖然這一招在Python中可以很輕鬆的解決）。

Referer

referer欄位很實用，一方面可以用於站內資料的防盜鏈。比如我們經常遇到的在別處複製的圖片連結，粘到我們的部落格中出現了“被和諧”的字樣。
這就是referer起到的作用，伺服器在接收到一個請求的時候先判斷Referer是否為本站的地址。如果是的話就返回正確的資源；如果不是，就返回給客戶端預先準備好的“警示”資源。
Referer欄位

所以再寫爬蟲的時候（尤其是爬人家圖片的時候），加上Referer欄位會很有幫助。

User-Agent

User-Agent欄位更是沒的說了。相信絕大部分有防爬長處理的網站都會判斷這個欄位。來檢測客戶端是爬蟲程式還是瀏覽器。

如果是爬蟲程式（沒有新增header的程式），伺服器肯定不會返回正確的內容啦；如果包含了這個欄位，才會進行到下一步的防爬蟲處理操作。
如果網站僅僅做到了這一步，而你的程式又恰好添加了User-Agent，基本上就可以順利的矇混過關了。
User-Agent欄位

隱藏域

很多時候，我們模擬登入的時候需要提交的資料並不僅僅是使用者名稱密碼，還有一些隱藏域的資料。比如拿咱們CSDN來說，檢視登入頁

https://passport.csdn.net/account/login

的時候，你會發現原始碼中有這樣的內容：
隱藏域

也就是說，如果你的程式僅僅post了username和password。那麼是不可能進入到webflow流程的。因為伺服器端接收請求的時候還會判斷有沒有lt和execution這兩個隱藏域的內容。

其他

防止爬蟲還有很多措施，我本人經驗還少，所以不能在這裡一一列舉了。如果您有相關的經驗，不妨留下評論，我會及時的更新到部落格中，我非常的贊同大家秉承學習的理念來交流。

模擬登入

在正式的模擬登入知乎之前，我先來寫個簡單的小例子來加深一下印象。

模擬防爬

模擬防爬肯定是需要伺服器端的支援了，下面簡單的寫一下來模擬整個過程。

伺服器端

login.php

先來看看： login.php

<?php
/**
 * @Author: 郭 璞
 * @File: login.php
 * @Time: 2017/4/7
 * @Contact: [email protected]
 * @blog: http://blog.csdn.net/marksinoberg
 * @Description:  模擬防爬處理
 **/

$username = $_POST['username'];
$password = $_POST['password'];
$token = $_POST['token'];

if (!isset($token)) {
    echo "登入失敗！";
    exit(0);
}else{
    // 這裡簡單的模擬一下token的計算規則，實際中會比這更加的複雜
    $target_token = $username.$username;
    if ($token == $target_token){
        if ($username ==='123456' and $password==='123456'){
            echo "登陸成功！<br>使用者名稱： ".$username."<br>密碼：".$password."<br>token: ".$token;
        }else{
            echo "使用者名稱或密碼錯誤！";
        }
    }else{
        echo "token 驗證失敗！";
    }
}

login.html

相對應的前端程式碼簡單的寫成下面： login.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>郭璞的小窩</title>
</head>
<body>

<form action="login.php" method="post">
    使用者名稱： <input type="text" name="username" id="username" required><br>
    密&nbsp;&nbsp;碼：<input type="password" name="password" required><br>
    <input type="hidden" name="token" id="token" value="">
    <hr>
    <input type="submit" value="登入">
</form>
<script>
    document.getElementById('username').onblur = function() {
        var username = document.getElementById("username").value;
        var token = document.getElementById('token');
        token.value = username+username;
    }
</script>

</body>
</html>

瀏覽器測試

正常提交使用者名稱密碼的話如下：

正確提交資訊

我們不難發現，伺服器端和客戶端使用了相同的計算規則，這樣的話我們就可以實現對客戶端的登入請求進行一次簡答的甄選了。正常的瀏覽器請求都是沒有問題的。

使用者名稱或者密碼填寫錯誤的情況如下：

使用者名稱密碼出錯的情況

爬蟲沒有新增隱藏域時

用爬蟲程式執行的話，如果沒有新增隱藏域的內容，我們就不可能正確地登入了。那麼先來看下這樣的傻瓜式爬蟲是怎麼失效的吧。
使用Python寫一個這樣的爬蟲用不了多少程式碼，那麼就用Python來寫吧。其他的介面測試工具postman，selenium等等也都是很方便的，這裡暫且不予考慮。

# coding: utf8

# @Author: 郭 璞
# @File: sillyway.py                                                                 
# @Time: 2017/4/7                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 傻瓜式爬蟲未新增隱藏域的值

import requests

url = "http://localhost/phpstorm/pachong/login.php"

payload = {
    'username': '123456',
    'password': '123456'
}

response = requests.post(url=url, data=payload)
print(response.text)

執行的結果如下：
傻瓜式爬蟲，未新增隱藏域資訊

對比PHP檔案對於請求的處理，我們可以更加輕鬆的明白這個邏輯。

添加了隱藏域的爬蟲

正如上面失敗的案例，我們明白了要新增隱藏域的值的必要性。那麼下面來改進一下。

因為我們”不知道”伺服器端是怎麼對token處理的具體的邏輯。所以還是需要從客戶端的網頁下手。
且看下面的圖片。
客戶端隱藏域內容獲取

注意：這裡僅僅是為了演示的方便，採用了對username欄位失去焦點時計算token。實際上在網頁被拉取到客戶端瀏覽器的時候，伺服器會事先計算好token的值，並賦予到token欄位的。所以大可不必計較這裡的實現。

Python程式碼

# coding: utf8

# @Author: 郭 璞
# @File: addhiddenvalue.py                                                                 
# @Time: 2017/4/7                                   
# @Contact: [email protected]
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 添加了隱藏域資訊的爬蟲

import requests

## 先獲取一下token的內容值，方便接下來的處理
url = 'http://localhost/phpstorm/pachong/login.php'

payload = {
    'username': '123456',
    'password': '123456',
    'token': '123456123456'
}

response = requests.post(url, data=payload)
print(response.text)

實現效果如下：
添加了token域內容的爬蟲效果

現在是否對於隱藏域有了更深的認識了呢？

知乎模擬登入

按照我們剛才的邏輯，我們要做的就是：

先開啟預登陸介面，目標：得到必須提交的隱藏域的值
然後通過post再次訪問該路徑（準備好了一切必須的資訊）
獲取網頁內容並進行解析，或者做其他的處理。

思路很清晰了，下面就可以直接上程式碼了。

# coding: utf8

# @Author: 郭 璞
# @File: ZhiHuLogin.py                                                                 
# @Time: 2017/4/7                                   
# @Contact: [email protected]
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 模擬登陸知乎
import re
from bs4 import BeautifulSoup
import subprocess, os
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Host": "www.zhihu.com",
    "Upgrade-Insecure-Requests": "1",
}

############################# 從郵箱方式登入
loginurl = 'http://www.zhihu.com/login/email'
session = requests.session()
html = session.get(url=loginurl, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
xsrf_token = soup.find('input', {'name':'_xsrf'})['value']
print("登入xsrf_token: "+xsrf_token)

############################ 下載驗證碼備用
checkcodeurl = 'http://www.zhihu.com/captcha.gif'
checkcode = session.get(url=checkcodeurl, headers=headers).content
with open('./checkcode.png', 'wb') as f:
    f.write(checkcode)
print('已經開啟驗證碼，請輸入')
# subprocess.call('./checkcode.png', shell=True)
os.startfile(r'checkcode.png')
checkcode = input('請輸入驗證碼：')
os.remove(r'checkcode.png')
############################ 開始登陸
payload = {
    '_xsrf': xsrf_token,
    'email': input('請輸入使用者名稱：'),
    'password': getpass.getpass(prompt="請輸入密碼："),#input('請輸入密碼：'),
    'remeber_me': 'true',
    'captcha': checkcode
}
response = session.post(loginurl, data=payload)
print("*"*100)

result = response.text
print("登入訊息為："+result)
tempurl = 'https://www.zhihu.com/question/57964452/answer/155231804'
tempresponse = session.get(tempurl, headers=headers)
soup = BeautifulSoup(tempresponse.text, 'html.parser')
print(soup.title)

實現的效果如下

模擬登入知乎效果

介面，我們正確的獲取到了title的內容。（也許你會說，正常訪問也會獲取到這個內容的，但是我們是從已登入的session上獲取的，請記住這一點哈。）。

更新版知乎模擬登陸

程式碼部分

# coding: utf8

# @Author: 郭 璞
# @File: MyZhiHuLogin.py                                                                 
# @Time: 2017/4/8                                   
# @Contact: [email protected]
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 我的模擬登入知乎

import requests
from bs4 import BeautifulSoup
import os, time
import re
# import http.cookiejar as cookielib

# 構造 Request headers
agent = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36'
headers = {
    "Host": "www.zhihu.com",
    "Referer": "https://www.zhihu.com/",
    'User-Agent': agent
}

######### 構造用於網路請求的session
session = requests.Session()
# session.cookies = cookielib.LWPCookieJar(filename='zhihucookie')
# try:
#     session.cookies.load(ignore_discard=True)
# except:
#     print('cookie 檔案未能載入')

############ 獲取xsrf_token
homeurl = 'https://www.zhihu.com'
homeresponse = session.get(url=homeurl, headers=headers)
homesoup = BeautifulSoup(homeresponse.text, 'html.parser')
xsrfinput = homesoup.find('input', {'name': '_xsrf'})
xsrf_token = xsrfinput['value']
print("獲取到的xsrf_token為： ", xsrf_token)

########## 獲取驗證碼檔案
randomtime = str(int(time.time() * 1000))
captchaurl = 'https://www.zhihu.com/captcha.gif?r='+\
             randomtime+"&type=login"
captcharesponse = session.get(url=captchaurl, headers=headers)
with open('checkcode.gif', 'wb') as f:
    f.write(captcharesponse.content)
    f.close()
# os.startfile('checkcode.gif')
captcha = input('請輸入驗證碼：')
print(captcha)

########### 開始登陸
headers['X-Xsrftoken'] = xsrf_token
headers['X-Requested-With'] = 'XMLHttpRequest'
loginurl = 'https://www.zhihu.com/login/email'
postdata = {
    '_xsrf': xsrf_token,
    'email': '郵箱@qq.com',
    'password': '密碼'
}
loginresponse = session.post(url=loginurl, headers=headers, data=postdata)
print('伺服器端返回響應碼：', loginresponse.status_code)
print(loginresponse.json())
# 驗證碼問題輸入導致失敗: 猜測這個問題是由於session中對於驗證碼的請求過期導致
if loginresponse.json()['r']==1:
    # 重新輸入驗證碼，再次執行程式碼則正常。也就是說可以再第一次不輸入驗證碼，或者輸入一個錯誤的驗證碼，只有第二次才是有效的
    randomtime = str(int(time.time() * 1000))
    captchaurl = 'https://www.zhihu.com/captcha.gif?r=' + \
                 randomtime + "&type=login"
    captcharesponse = session.get(url=captchaurl, headers=headers)
    with open('checkcode.gif', 'wb') as f:
        f.write(captcharesponse.content)
        f.close()
    os.startfile('checkcode.gif')
    captcha = input('請輸入驗證碼：')
    print(captcha)

    postdata['captcha'] = captcha
    loginresponse = session.post(url=loginurl, headers=headers, data=postdata)
    print('伺服器端返回響應碼：', loginresponse.status_code)
    print(loginresponse.json())




##########################儲存登陸後的cookie資訊
# session.cookies.save()
############################判斷是否登入成功
profileurl = 'https://www.zhihu.com/settings/profile'
profileresponse = session.get(url=profileurl, headers=headers)
print('profile頁面響應碼：', profileresponse.status_code)
profilesoup = BeautifulSoup(profileresponse.text, 'html.parser')
div = profilesoup.find('div', {'id': 'rename-section'})
print(div)

驗證效果

更新版知乎模擬登陸

總結

經過了今天的測試，發現自己之前對於網頁的處理理解的還是不夠到位。

對於“靜態頁面”，常用的urllib, requests應該是可以滿足需要的了。
對於動態頁面的爬取，可以使用無頭瀏覽器PhantomJS，Selenium等來實現。

但是一直處理的不夠精簡，導致在爬一些重定向頁面的過程中出現了很多意想不到的問題。

在這塊的爬蟲程式還有很多地方需要進行完善啊。

另外模擬登入還有一個利器，那就是cookie。下次有時間的話再來學習一下使用cookie來實現。今天就先到這裡吧。

Python 模擬登入知乎

前言

必備知識點

headers

Referer

User-Agent

隱藏域

其他

模擬登入

模擬防爬

伺服器端

login.php

login.html

瀏覽器測試

正常提交使用者名稱密碼的話如下：

使用者名稱或者密碼填寫錯誤的情況如下：

爬蟲沒有新增隱藏域時

添加了隱藏域的爬蟲

知乎模擬登入

更新版知乎模擬登陸

程式碼部分

驗證效果

總結

相關推薦