1. 程式人生 > >python3爬蟲入門程式

python3爬蟲入門程式

適用於有且只有一點Python3和網頁基礎的朋友,大牛&路人請繞道

(本文很多廢話,第一次在網上長篇大論,所以激動的停不下來,如果有大佬路過,也希望不要直接繞道,煩請指點一二)

感謝部落格園給了我一個機會,我喜歡的id還沒有被搶注,真的是太可怕了

注:這是一段廢話,正文請直接跳過這一段. 大二的時候因為愛好,自己學了點python(當初學主要是因為語法簡潔美觀,還沒有大括號,程式碼對齊?反正java程式碼也要對齊啊~),還好我學python的時候py3已經流行起來了,沒有學py2,不然又得好一陣折騰.

寫這個程式碼的背景:記得高二的時候逛騙子網站,出於好奇在網站上留下了同桌的手機號,結果這都過了大概大半年了吧,同桌還是接二連三的可以收到騙子的電話,但是原來的網址已經找不到了,於是就在百度隨便搜尋了關鍵詞"牛股"挨個檢視的,只找到了一個可以輸入手機號的,剩下的都是讓加微信的,本來想整整我們老師呢,想想還是免了,乾脆往他們資料庫填點東西玩吧,其實最終是否成功我也不能確定

  • 操作環境
  • win10 1803 64位
  • Chrome 68.0.3440.106(正式版本) (64 位)
  • pycharm-UI(pycharm專業版) 2018.2
  • python-365
  • 庫(非自帶庫用pip直接安裝就行):
    • pymysql :import pymysql
    • requests :import requests
    • json(自帶) :import json
    • Faker: :from faker import Faker

      首先選取目標

      1.首先肯定是抓取一下post/get地址

      進入首頁後點擊"點選領取9月牛股"彈出對話方塊後,按F12彈出開發者工具 在開發者工具中選中"Network"

      ,隨後點選網頁中的點選領取,會看到network中多出來一條檔案資訊 然後提取一下我們需要的資料放到pycharm中,並整理成這種json格式:

2.這樣我們就得到了這些資料:

url = r"https: // download.zslxt.com / tinterface.php"
headers = {
    "Host": "download.zslxt.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Content-Length": "107",
    "Origin": "http://gpyd.gp241.com",
    "Connection": "keep-alive"
}
data = {
    "bm": "gbk",
    "gpdm": "",
    "id": "20110052",
    "phone": "15666668888",
    "qudao": 98,
    "remarks": "牛有圈百度2)"
}

這裡看data的引數也應該明白了,這裡就是我們剛才輸入的手機號了,別的程式碼可以不動,我剛才切換瀏覽器發現並沒有影響,不知道是怎麼來的,可能是跟百度推廣有關吧

然後,這樣之後就可以向網站傳送一條資料了

首先我們要使用requests庫,這裡就不介紹了,是一個可以用來請求get/post...還可以使用session保持登陸,用途很廣

這裡有一點需要注意,就是data資料不可以直接傳送,需要用json.dumps()方法轉為字串

requests可以返回請求資料,這段程式碼並沒有體現出來,但是請不要被誤導

import requests
import json

url = r"https: // download.zslxt.com / tinterface.php"
headers = {
    "Host": "download.zslxt.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Content-Length": "107",
    "Origin": "http://gpyd.gp241.com",
    "Connection": "keep-alive"
}
data = {
    "bm": "gbk",
    "gpdm": "",
    "id": "20110052",
    "phone": "15666668888",
    "qudao": 98,
    "remarks": "牛有圈百度2)"
}

requests.post(url=url, headers=headers, data=json.dumps(data))

可是,總不能只發送一次吧

這裡先介紹一下python中最假的庫--Faker

其實這個庫的"造假"功能出乎意料的強大,有興趣的可以去了解一下

在這個例子中大概只需要兩個功能:生成隨機user-agent和手機號碼(甚至這個網站也沒必要隨機user-agent,因為我沒有使用代理ip提交了大概兩千條資料,都沒有被封)

這樣之後,我們的程式碼就學會了一點偽裝的皮毛 (這裡插個題外話,那天在某論壇看到一個朋友問為什麼一直在更換代理還是被封號了,,,當時我用的手機也沒有哪個論壇的帳號因此不方便回覆他,同一個賬號一直在更換ip這種行為不正常吖) 這樣之後我們的程式碼便成了如下程式碼:

import requests
import json
from faker import Faker

f = Faker(locale="zh-CN")


user_agent = f.user_agent()
phone = f.phone_number()
url = r"https: // download.zslxt.com / tinterface.php"
headers = {
    "Host": "download.zslxt.com",
    "User-Agent": user_agent,
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Content-Length": "107",
    "Origin": "http://gpyd.gp241.com",
    "Connection": "keep-alive"
}
data = {
    "bm": "gbk",
    "gpdm": "",
    "id": "20110052",
    "phone": phone,
    "qudao": 98,
    "remarks": "牛有圈百度2)"
}

req = requests.post(url=url, headers=headers, data=json.dumps(data))

然後就快要完成了,為了方便迴圈傳送資料,我們再把它整理成一段函式:

其實我一開始學python真的不喜歡寫函式,畢竟那麼兩行程式碼就能寫完了,包裝成一個函式簡直就是在湊程式碼行數,毫無用途,但是我今天看到了一個故事:

為了檢測空的奶盒子,博士後和農民用兩種方式解決了這個問題:發明一臺機器,使用了一臺風扇 但是很多時候我們新學東西時遇到的問題都可以用以前就會的方法解決這個問題,但是隨著問題的深入,有時候就只能使用新學的只是來解決以後遇到的問題了,寫寫函式(包裝成類)總是沒錯的,前提是這個程式碼你是用來練手的,而不是用來應急的.

import requests
import json
from faker import Faker

f = Faker(locale="zh-CN")


def duang():
    user_agent = f.user_agent()
    phone = f.phone_number()
    url = r"https: // download.zslxt.com / tinterface.php"
    headers = {
        "Host": "download.zslxt.com",
        "User-Agent": user_agent,
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Content-Length": "107",
        "Origin": "http://gpyd.gp241.com",
        "Connection": "keep-alive"
    }
    data = {
        "bm": "gbk",
        "gpdm": "",
        "id": "20110052",
        "phone": phone,
        "qudao": 98,
        "remarks": "牛有圈百度2)"
    }

    req = requests.post(url=url, headers=headers, data=json.dumps(data))
    return user_agent, phone, req

這樣我們就可以方便的進行呼叫了,寫個main函式來呼叫它

import requests
import json
from faker import Faker

f = Faker(locale="zh-CN")


def duang():
    user_agent = f.user_agent()
    phone = f.phone_number()
    url = r"https: // download.zslxt.com / tinterface.php"
    headers = {
        "Host": "download.zslxt.com",
        "User-Agent": user_agent,
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Content-Length": "107",
        "Origin": "http://gpyd.gp241.com",
        "Connection": "keep-alive"
    }
    data = {
        "bm": "gbk",
        "gpdm": "",
        "id": "20110052",
        "phone": phone,
        "qudao": 98,
        "remarks": "牛有圈百度2)"
    }

    req = requests.post(url=url, headers=headers, data=json.dumps(data))
    return user_agent, phone, req


if __name__ == '__main__':
    for i in range(100000):
        user_agent, phone, req = duang()
        print(i, '\t', phone, '\t', req.status_code, '\n', user_agent)

這裡就是輸出一下資訊啦,剛才出去吃飯的時候斷網了,只跑了3000多,這裡就不截圖了(如果真有用來練手的朋友可以嘗試自己完善一下程式碼,斷網後也可以等待並繼續執行)

附上全部程式碼(寫到mysql了):

import pymysql
import requests
import json
from faker import Faker

f = Faker(locale="zh-CN")


def duang():
    user_agent = f.user_agent()
    phone = f.phone_number()
    url = r"https: // download.zslxt.com / tinterface.php"
    headers = {
        "Host": "download.zslxt.com",
        "User-Agent": user_agent,
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "http:/gpyd.gp241.com/nyqpc/bd2.html?id=20110052",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Content-Length": "107",
        "Origin": "http://gpyd.gp241.com",
        "Connection": "keep-alive"
    }
    data = {
        "bm": "gbk",
        "gpdm": "",
        "id": "20110052",
        "phone": phone,
        "qudao": 98,
        "remarks": "牛有圈百度2)"
    }

    req = requests.post(url=url, headers=headers, data=json.dumps(data))
    return user_agent, phone, req.status_code


if __name__ == '__main__':
    for i in range(100000):
        user_agent, phone, status_code = duang()
        db = pymysql.connect("localhost", "root", "xiaoyan", "python")
        cur = db.cursor()
        cur.execute(f"INSERT INTO python1duang VALUES(default,'{user_agent}','{phone}','{status_code}')")
        db.commit()
        print(i, '\t', phone, '\t', status_code, '\n', user_agent)
    db.close()