Python搭建自己的ip代理池學習總結

阿新 • • 發佈：2019-01-13

剛剛學習了爬蟲入門爬取小說，覺得那是不夠的，很多時候你僅僅偽裝成瀏覽器是解決不了問題的，還需要解決別人封你的ip的問題，所以我又學習瞭如何爬蟲是更換自己的ip，想要有足夠的ip來供你更換，一定得先搭建你的ip代理池，那麼我們就先搭建自己的ip代理池。

什麼是ip代理池，就是很多代理ip地址放到一起足夠讓你去更換，那麼我們從哪裡去獲取足夠多的ip呢，當然很多前人已經幫我們解決問題了我目前只是入門了，所以我用的是https://www.xicidaili.com/nn/1這個網站，我們先想爬小說一樣爬取這個網站上的代理ip，由於我還不會使用資料庫我們就先把這個代理ip以字典的形式存放在文字檔案裡面，看起來很容易，不過本蒟蒻弄了好久，遇到了很多問題：

爬取西刺網上的ip時每一個頁的每個ip有很多資訊包括，ip地址，埠，地址，速度等資訊，如果你想爬取小說那樣用正則表示式是有問題的，資訊太多啦，一個要用''' '''才能包含下，而且一個不小心就會，啥頁提取不出來，後面才知道用的xpath提取的，需要bs4，lxml庫，不過在宣告的時候它又說我這個沒有這個包，需要安裝，要需要匯入，這些操作都是網上有的我就不細說了。反正搞啦好久才解決問題。
學習的時候很多人都告訴你，很多ip是不能用的，需要自己手動檢測，然後就需要寫一個檢測ip是否有效果的問題了。
爬取ip地址的時候還遇到了，html是亂碼的問題，要他用ecodeing='gbk'，雖然西刺網沒有這個情況.

參考：怎樣簡單的搭建一個免費的IP代理池，python中requests爬去網頁內容出現亂碼的解決方案

程式碼：

import requests
from lxml import etree
import re
import time
import random
import telnetlib
def check(ip,port):
    try:
        telnetlib.Telnet(ip, port, timeout=20)
    except:
        return False
    else:
        return True
send_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
        "Connection": "keep-alive",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8"
    }  # 偽裝成瀏覽器
def get_agent():
    agents = [
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
              'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',
              'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
              'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
              'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'
              ]#獲取agent   在多個模擬的瀏覽器中隨機選取一個
    fakeheader = {}
    fakeheader['User-agent'] = agents[random.randint(0, len(agents)-1)]
    return fakeheader
def get_iplist(url,max_page):
    ip_list={}
    for i in range(1,max_page+1):#爬取很多頁的代理ip
        response=requests.get(url+str(i),headers=get_agent())
        response.encoding='utf-8'
        html=response.text
        response.close()
        info = etree.HTML(html)
        address = info.xpath('//tr[@class="odd" or class=""]/td[2]/text()')  # IP地址
        ports = info.xpath('//tr[@class="odd" or class=""]/td[3]/text()')  # 埠
        anonymous = info.xpath('//tr[@class="odd" or class=""]/td[5]/text()')  # 匿名形式
        http_https = info.xpath('//tr[@class="odd" or class=""]/td[6]/text()')  # http or https
        speed = info.xpath('//tr[@class="odd" or class=""]/td[7]/div[1]/@title')  # 連線速度
        speed_width = info.xpath('//tr[@class="odd" or class=""]/td[7]/div[1]/div/@style')  # 連線速度的比例
        conn_time = info.xpath('//tr[@class="odd" or class=""]/td[8]/div[1]/@title')  # 連線時間
        conn_time_width = info.xpath('//tr[@class="odd" or class=""]/td[8]/div[1]/div/@style')  # 連線時間的比例
        life = info.xpath('//tr[@class="odd" or class=""]/td[9]/text()')  # 存活時間
        test = info.xpath('//tr[@class="odd" or class=""]/td[10]/text()')  # 檢驗時間
    cnt=int(0)
    for i in range(0,len(address)-1):
        if check(address[i],ports[i]):
            ip_list[str(cnt+1)]={
                'IP地址': address[i]+':'+ports[i],
                '是否匿名': anonymous[i],
                '型別': http_https[i],
                '速度': eval((re.compile('(.*?)秒').findall(speed[i]))[0]),
                '速度比例': eval((re.compile('width:(.*?)%').findall(speed_width[i]))[0]),
                '連線時間': eval((re.compile('(.*?)秒').findall(conn_time[i]))[0]),
                '耗時比例': eval((re.compile('width:(.*?)%').findall(conn_time_width[i]))[0]),
                '存活時間': eval((re.compile('(\d+).*?').findall(life[i]))[0]),
                '驗證時間': test[i]
            }
            try :
               #with open('porxy_ip.csv', 'w', encoding='utf-8') as fp:
                fp.write(str(ip_list[str(cnt+1)]))
                fp.write('\n')
                print('成功寫入ip及埠：%s'%(address[i]+':'+ports[i]))
                cnt=cnt+1
            except:
                print("寫入錯誤！")
            #exit()
        else :
            pass
    return ip_list
ip_url='https://www.xicidaili.com/nn/'
fp=open('porxy_ip.csv', 'w', encoding='utf-8')
max_page=1
ip=get_iplist(ip_url,max_page)
print('成功寫入%s個可用ip地址'%str(len(ip)))

Python搭建自己的ip代理池學習總結

Python搭建自己的ip代理池學習總結

爬蟲老是被封IP？看我大Python搭建高匿代理池！封IP你覺得可能嗎

Python爬蟲之ip代理池

ip代理池學習

python學習 —— 建立IP代理池

自己搭建億級爬蟲IP代理池

python 反爬總結（1）- 限制IP UA 的解決方法，修改headers和新增IP代理池

centos7生產環境IP代理池（python）

怎樣簡單的搭建一個免費的IP代理池

建立自己的IP代理池[爬取西刺代理]

python爬取身份證資訊、爬取ip代理池

Python爬蟲IP代理池的建立和使用

構建自己的IP代理池

[ python編程 ] subprocess模塊學習總結

ip代理池-基於mongodb數據庫

[Python] wxPython 編輯框組件學習總結 (原創)

零基礎學習python編程不可錯過的學習總結，小白福利！

小白也能做的IP代理池，好久沒更新了，不知道你們想看什麽呢！

python asyncio異步代理池

【Python3爬蟲】Scrapy使用IP代理池和隨機User-Agent

Python搭建自己的ip代理池學習總結

相關推薦