1. 程式人生 > >Python爬蟲基礎(一)

Python爬蟲基礎(一)

最近在學習python,順便了解一下網路爬蟲,整理了一下爬蟲基礎(基於py2.7):

獲取網頁資料的三種方法:

# encoding=utf-8

import urllib2

def download1(url):
    return urllib2.urlopen(url).read()
    # read()方法是預設獲取全部資料
    # read(100)方法是獲取前100個字元

def download2(url):
    return urllib2.urlopen(url).readlines()

def download3(url):
    response = urllib2.urlopen(url)
    while
True: line = response.readline() if not line: break print line url = "http://wwww.baidu.com" print download3(url)

基於urllib2框架,這個比較簡單。

偽裝瀏覽器

現在很多網站為了防止資料被爬取,都使用了反爬蟲措施。為了在這種情況下能繼續使用爬蟲,目前所學有兩種方案:一是新增隨機的Header,二是使用框架進行模擬,其實意思都差不多。
新增隨機的header:

import urllib2


def
download(url):
# header = {"User-Agent": "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"} header = {"User-Agent": "User-Agent: UCWEB7.0.2.37/28/999"} request = urllib2.Request(url=url, headers=header) # add another header request.add_header("name", "zhangsan"
) # open the request response = urllib2.urlopen(request) print "result:" + str(response.code) print response.read() download("http://www.baidu.com")

一般我們可以使用隨機數進行header的選取,就像上面的分別模擬了IE瀏覽器和手機UC瀏覽器進行訪問。當然了,網上有很多User-Agent,大家可以隨機選取一下,進行隨機模擬,下面是網上摘抄的一部分記錄,大家可以使用:

pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}

mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}

第二種方法,可以使用selenium網路測試框架,進行遠端訪問,簡單程式碼如下:

import selenium   #網路測試框架
import selenium.webdriver   #模擬瀏覽器訪問

def getJobNumberByName(name):
     target_url = "http://www.baidu.com"
     driver = selenium.webdriver.Chrome()  # 模擬瀏覽器請求,
     driver.get(target_url)  # 模擬訪問連線
     page_source = driver.page_source  # 獲取網頁資訊
     print page_source

selenium可以檢視這裡簡介,它會呼叫OS上的瀏覽器driver,如果沒有相應的配置,則會報下面的錯誤:
這裡寫圖片描述
如果你使用的webdriver.Chrome(),那麼需要一個chromedriver,可以下載解壓,獲取地址,更改程式碼為:

driver = selenium.webdriver.Chrome(chrom_driver_path) # 模擬瀏覽器請求,

編碼統一

主要涉及到是中文傳輸,中文在傳輸的過程中如果不採取編碼,那麼伺服器接受到的內容將會是亂碼。編碼方式如下:

import urllib
words = {"name":"zhangsan","address":"上哈"}

print urllib.urlencode(words)  #url編碼
print  urllib.unquote(urllib.urlencode(words)) #url解碼

get/post請求

get與post請求主要是引數的傳遞方式不同,get直接在url後面新增引數,而post則將引數封裝在請求體中。
使用python的flask框架建立一個簡單的server:

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello World!'

@app.route("/login", methods=["POST"])
def login():
    name = request.form.to_dict().get("name", "")
    age = request.form.to_dict().get("age", "")
    return name + "-------" + age

@app.route("/query", methods=["GET"])
def query():
    age = request.args.get("age", "")
    return "this age is " + age


if __name__ == '__main__':
    app.run(
        "127.0.0.1",
        port=8090
    )

那麼get請求為:

import  urllib2

words = {"age" : "23"}
request = urllib2.Request(url="http://127.0.0.1:8090/query?" + urllib.urlencode(words))
response = urllib2.urlopen(request)

print response.read()

post請求為:

import urllib2

info = {"name":"Tom張","age":"20"}
info = urllib.urlencode(info)  # 這是也需要進行url編碼

request = urllib2.Request("http://127.0.0.1:8090/login")
request.add_data(info)
response = urllib2.urlopen(request)

print response.read()

圖片下載

import urllib

urllib.urlretrieve(圖片原始地址,圖片本地儲存地址)

代理與本地代理

多個爬蟲使用單個ip時,如果此時IP地址被禁止,那爬蟲就沒法正常工作了,所以這也衍生了不少生態鏈,某寶上搜索”vps“等關鍵字,可以看到各種專業代理,如下圖:
這裡寫圖片描述

當然我們可以使用免費的代理:[代理更新時間為18年4月21號 16點14分]

https://www.kuaidaili.com/free/  ## 快代理
http://www.xicidaili.com/        ## 西刺代理

這裡寫圖片描述

python程式碼使用代理為:

import urllib2

http_proxy = urllib2.ProxyHandler({"http":"117.90.3.126:9000"})  #代理ip與埠
opener     = urllib2.build_opener(http_proxy)
request    = urllib2.Request("http://www.baidu.com")
response   = opener.open(request)

print response.read()

重定向

1.判斷url是否被重定向了:

import urllib2


# 判斷url是否重定向了
 def url_is_redirect(url):
   response = urllib2.urlopen(url)
   return response.geturl() != url

print url_is_redirect("http://www.baidu.cn")

2.如果是重定向,那我們需要獲取新的地址:

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        res = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
        res.status = code             # 返回的編碼
        res.newurl = res.geturl()     # 當前的URL
        print res.newurl, res.status  # 檢視重定向url
        return res

opener = urllib2.build_opener(RedirectHandler)
opener.open("http://www.baidu.cn")

cookie相關

網頁的關聯性獲取,需要用到cookie。
1.cookie的獲取:

# encoding=utf-8
import urllib2
import cookielib

#create a cookie object
cookie = cookielib.CookieJar()
#get the cookie
header = urllib2.HTTPCookieProcessor(cookie)
#deal the cookie
opener = urllib2.build_opener(header)

response = opener.open("http://www.baidu.com")

for data in cookie :
    print data.name + "--" + data.value + "\r"

獲取的結果為:

BAIDUID--2643F48FC95482FF4ECAD2EBC7DBE11E:FG=1
BIDUPSID--2643F48FC95482FF4ECAD2EBC7DBE11E
H_PS_PSSID--1466_21088_18560_22158
PSTM--1524360190
BDSVRTM--0
BD_HOME--0

2.cookie的讀取:

# encoding=utf-8

import urllib2
import cookielib

file_path = "cookie.txt"
cookie = cookielib.LWPCookieJar(file_path)  # 設定路徑
header = urllib2.HTTPCookieProcessor(cookie)  # 設定cookie,與網站有關
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")

cookie.save(ignore_expires=True, ignore_discard=True)  

執行之後,cookie.txt將會存入我們的cookie檔案

基本上就這些差不多了,剩下的慢慢再上來更新吧。