爬蟲requests登入並跳轉

阿新 • • 發佈：2019-01-14

首先，客戶需要爬取的頁面是： http://www.huobiao.cn/search?word=&block=1 底下各個標的詳情資料。

如果沒有登入的話，招標詳情一些關鍵資訊會被隱藏，像這樣：

而登入後這些資訊都會展示出來。

經過分析，本次爬蟲需要向三個頁面請求資料，第一個是登入頁面，第二個是請求每一頁中的資料，第三個根據返回的資料找到每個公告的詳情網址，向這個網址請求詳細的資料。這麼說有點抽象，直接來看每一步做了什麼。

第一步，登入

開啟開發者工具，選擇network一欄，按照如圖所示的順序操作。

首先點選登入，會彈出一個登入框，但是在開發者工具並沒有看到網址發生變化，猜測只是一個js操作，起了一個執行緒，網址沒有發生變化。但是這不影響我們操作，我們不輸入任何登入資訊，直接點選“立即登入”，一個網址赫然在列：

可以看到請求的網址是： http://www.huobiao.cn/do_login

，請求攜帶三個資料以供伺服器校驗，由此構造requests請求：

login_session = requests.Session()
# 按照網頁的請求構造請求頭
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
        'Referer': 'http://www.huobiao.cn/search?word=&block=1',
    }


def login():
    data = {
        'phone': 'xxxxxxxx',
        'password': 'xxxxxxx',
        'check': 'on',
    }
    # 按照服務端的需要構造請求，並獲得session，記錄在login_session
    response = login_session.post('http://www.huobiao.cn/do_login', data=data, verify=False, headers=headers)
    print(response.content)

列印一下這個response看一下返回的內容，返回一個json格式的資料：

b'{"code":"1","msg":"\\u767b\\u5f55\\u6210\\u529f","ret_data":[],"timestamp":1537496282}'

其中的msg內容看起來像是一個unicode字串，去站長工具網站（http://tool.chinaz.com/Tools/Unicode.aspx）轉化一下看看是什麼內容：

b'{"code":"1","msg":"\登\錄\成\功","ret_data":[],"timestamp":1537496282}'

OK，長舒一口氣。

第二步：請求列表頁資訊

當我們在列表上點選搜尋的時候，會出現一個非常可疑的請求：

經過分析，這個正是請求列表詳情的url，請求的網址是：http://www.huobiao.cn/do_search，我們來看一下請求的資料和返回的資料：

請求的json資料是這樣的：

也即是：

{"type":"0","time":"1537496995","province":"全國","start":"","end":"","category":"","item":"0","page":1,"word":"","project_type":"","tap":1}

這裡也就是網頁上的搜尋條件，其中page和time值得注意，page表明了當前列表的頁數，time是一個10位的數字，是一個unix時間戳。我們按照這個格式建造一個請求。

def search_list():
    s_l = []
    for i in range(5):
        time.sleep(3)
        search = {
            "type": "0",
            "time": str(int(time.time())),
            "province": "全國",
            "start": "",
            "end": "",
            "category": "",
            "item": "0",
            "page": i+1,
            "word": "",
            "project_type": "",
            "tap": 1,
        }
        s_l.append(search)
    return s_l

好了，我們帶著這些資料，向http://www.huobiao.cn/do_search 請求資料吧。得到的是一個json格式的二進位制json字串資料，解析這個json資料之後：

好了，我們看到了很多有用的資料，title，address，當然，對我們最重要的是href了，當然這個href需要加上：http://www.huobiao.cn 拼接一下才能正常訪問。這個href就是我們最終需要的招標詳情頁的url了。還剩最後一步：爬取招標詳情頁！

第三步：爬取招標詳情頁

在第二步獲得的url，在這一步將被直接通過get請求獲得資訊，之後通過BeautifulSoup解析需要的資料，將需要的資料存入Excel表格中。

完整程式碼如下：

from bs4 import BeautifulSoup
import requests
import json
import time
from xlutils.copy import copy
from xlrd import open_workbook


login_session = requests.Session()
# 按照網頁的請求構造請求頭
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
        'Referer': 'http://www.huobiao.cn/search?word=&block=1',
    }


def login():
    data = {
        'phone': '18668115440',
        'password': 'wasd122816',
        'check': 'on',
    }
    # 按照服務端的需要構造請求，並獲得session，記錄在login_session
    login_session.post('http://www.huobiao.cn/do_login', data=data, verify=False, headers=headers)


def search_list(page):
    search = {
        "type": "0",
        "time": str(int(time.time())),
        "province": "全國",
        "start": "",
        "end": "",
        "category": "",
        "item": "0",
        "page": page+1,
        "word": "",
        "project_type": "",
        "tap": 1,
    }
    return search


def home():
    for page in range(5):
        # 登入後，客戶端將會向這個網址請求資料
        response = login_session.post('http://www.huobiao.cn/do_search', data=search_list(page), headers=headers, allow_redirects=False)
        # 返回的是一個json的二進位制資料，先將其轉成utf-8編碼格式
        answer = str(response.content, encoding="utf-8")
        # 再將json格式轉成dict格式，方便處理
        diction = json.loads(s=answer)
        for company in diction['ret_data']['list']:
            try:
                #  獲得每個公告的主頁url，這個url不是完整的url，需要拼接
                company_url = company['href']
                company_url = 'http://www.huobiao.cn'+company_url
                #  用登入時獲得的session繼續請求每個公告的主頁，這樣就可以保持登入狀態
                company_detail = login_session.get(company_url)
                soup = BeautifulSoup(company_detail.content, 'html.parser')
                row = {}
                row['project_name'] = soup.find('div', class_='dm-main').find('h6', class_='title').get_text().strip()
                row['project_company'] = soup.find('div', class_='dm-main').find('div', class_='cw-top').find('span', class_='address').get_text().strip()
                row['pubtime'] = soup.find('div', class_='dm-main').find('div', class_='cw-top').find('span', class_='pubtime').get_text().strip()
                row['board_content'] = soup.find('div', class_='dm-content').find('div', class_='cw-reftext').find('a')['href']

                table = soup.find('table')
                t_r0 = table.findAll('tr')[0]
                # 由於公告內容的格式不一樣，兩種格式需要分別處理
                row['G'] = t_r0.findAll('td')[0].get_text().strip('\t')
                #  有正規的表格的是以2開頭的時間
                if row['G'][0] == '2':
                    row['main_content'] = ''
                    for p in soup.find('div', class_='dm-content').find('div', class_='cw-maincontent').findAll('p'):
                        row['main_content'] += p.get_text()

                    row['H'] = t_r0.findAll('td')[1].get_text().strip()
                    t_r1 = table.findAll('tr')[1]
                    row['I'] = t_r1.findAll('td')[0].get_text().strip()
                    row['J'] = t_r1.findAll('td')[1].get_text().strip()
                    t_r2 = table.findAll('tr')[2]
                    row['K'] = t_r2.findAll('td')[0].get_text()
                    row['L'] = t_r2.findAll('td')[1].get_text()
                else:
                    #  沒有正規表格就把全部數字都抓取下來
                    row['main_content'] = soup.find('div', class_='dm-content').find('div', class_='cw-maincontent').get_text()
                    row['G'] = 'none'
                    row['H'] = 'none'
                    row['I'] = 'none'
                    row['J'] = 'none'
                    row['K'] = 'none'
                    row['L'] = 'none'
            except Exception:
                continue
            else:
                rexcel = open_workbook("deal.xls")  # 用wlrd提供的方法讀取一個excel檔案
                global_row = rexcel.sheets()[0].nrows  # 用wlrd提供的方法獲得現在已有的行數
                excel = copy(rexcel)  # 用xlutils提供的copy方法將xlrd的物件轉化為xlwt的物件

                table = excel.get_sheet(0)  # 用xlwt物件的方法獲得要操作的sheet

                table.write(global_row, 0, row['project_name'])
                table.write(global_row, 1, row['project_company'])
                table.write(global_row, 2, row['pubtime'])
                table.write(global_row, 3, row['board_content'])
                table.write(global_row, 4, row['main_content'])
                table.write(global_row, 5, row['G'])
                table.write(global_row, 6, row['H'])
                table.write(global_row, 7, row['I'])
                table.write(global_row, 8, row['J'])
                table.write(global_row, 9, row['K'])
                table.write(global_row, 10, row['L'])

                global_row += 1
                excel.save('deal.xls')
                print(company_url)
                print(row)


if __name__ == '__main__':
    login()  # 處理登入
    home()   # 獲得每個公告的主頁並分別處理它們

大功告成！

爬蟲requests登入並跳轉

第一步，登入

第二步：請求列表頁資訊

第三步：爬取招標詳情頁

爬蟲requests登入並跳轉

C# 實現登入並跳轉介面

C# 實現 MySql資料庫連線登入並跳轉介面

django寫使用者登入判定並跳轉制定頁面

spring security 採用資料庫配置檢測使用者登入，並跳轉不同頁面

判斷session失效，並跳轉到登入頁面

獲取當前網址並跳轉多個域名

EXCEL匹配結果match並跳轉鏈接hyperlink

html5判斷瀏覽器來源並跳轉

判斷pc端或移動端並跳轉

點擊存緩存並跳轉頁面並到跳轉頁面取緩存

PYQT5登入介面跳轉主介面方法

點選存快取並跳轉頁面併到跳轉頁面取快取

點選彈窗提示，3秒後關閉視窗並跳轉新的頁面

js獲取url 中的值，並跳轉相應頁面

微信公眾平臺服務號傳送模板訊息並跳轉小程式

iOS獲取通知狀態並跳轉設定介面設定

(javaweb)jsp登入頁面跳轉後出錯：The requested resource is not available.

django生成連結並跳轉

vue路由傳參並跳轉頁面

爬蟲requests登入並跳轉

第一步，登入

第二步：請求列表頁資訊

第三步：爬取招標詳情頁

相關推薦