Django爬蟲基本原理及Request和Response分析

阿新 • • 發佈：2018-07-01

detail 密碼 href Go 模塊 ica 正則表達式 ons CI

一、爬蟲

互聯網是由網絡設備（網線，路由器，交換機，防火墻等等）和一臺臺計算機連接而成，像一張網一樣。

互聯網的核心價值在於數據的共享/傳遞：數據是存放於一臺臺計算機上的，而將計算機互聯到一起的目的就是為了能夠方便彼此之間的數據共享/傳遞，否則就只能拿U盤去拷貝數據了。

互聯網中最有價值的便是數據，爬蟲模擬瀏覽器發送請求->下載網頁代碼->只提取有用的數據->存放於數據庫或文件中，就得到了我們需要的數據了

爬蟲是一種向網站發起請求，獲取資源後分析並提取有用數據的程序。

二、爬蟲流程

1、基本流程介紹

發送請求-----> 獲取響應內容----->解析內容 ----->保存數據

#1、發起請求
使用http庫向目標站點發起請求，即發送一個Request
Request包含：請求頭、請求體等

#2、獲取響應內容
如果服務器能正常響應，則會得到一個Response
Response包含：html，json，圖片，視頻等

#3、解析內容
解析html數據：正則表達式，第三方解析庫如Beautifulsoup，pyquery等
解析json數據：json模塊
解析二進制數據:以b的方式寫入文件

#4、保存數據
數據庫
文件

2、Request

常用的請求方式：GET，POST

其他請求方式：HEAD，PUT，DELETE，OPTHONS

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

百度搜索內容爬取頁面：

import requests
response=requests.get("https://www.baidu.com/s",
                      params={"wd":"美女","a":1},
                      headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
                      })                #模擬在百度搜索美女的第一頁內容，wd後面跟輸入的搜索內容  #自己定制headers，解決網站的反爬蟲功能
print(response.status_code)
print(response.text)
with open("bd.html","w",encoding="utf8") as f:
    f.write(response.text)                        #下載的頁面寫進bd.html文件，文件用瀏覽器打開發現和百度頁面一樣

3、Response

# 1、響應狀態

200：代表成功

301：代表跳轉

404：文件不存在

403：權限

502：服務器錯誤

# 2、Respone header

Location：跳轉

set - cookie：可能有多個，是來告訴瀏覽器，把cookie保存下來

# 3、preview就是網頁源代碼

最主要的部分，包含了請求資源的內容

如網頁html，圖片，二進制數據等

# 4、response屬性

import requests
respone=requests.get('http://www.jianshu.com')
# respone屬性
print(respone.text)                     # 獲取響應文本
print(respone.content)                 #獲取網頁上的二進制圖片、視頻
print(respone.status_code)               #響應狀態碼
print(respone.headers)                   #響應頭

print(respone.cookies)                   #獲取cookies信息
print(respone.cookies.get_dict())
print(respone.cookies.items())

print(respone.url)
print(respone.history)                 #獲取history信息（頁面經過重定向等方式，不是一次返回的頁面）
print(respone.encoding)                #響應字符編碼

#關閉：response.close()
from contextlib import closing
with closing(requests.get('xxx',stream=True)) as response:
    for line in response.iter_content():
    pass

#5、獲取大文件

#stream參數:一點一點的取,對於特別大的資源，一下子寫到文件中是不合理的
import requests
response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',
                      stream=True)
with open('b.mp4','wb') as f:
    for line in response.iter_content():           # 獲取二進制流(iter_content)
        f.write(line)

三、爬取校花網視頻（加了並發的）

import requests         #安裝模塊 pip3 install requests
import re
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor

pool=ThreadPoolExecutor(50)
movie_path=r'C:\mp4'

def get_page(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.text
    except Exception:
        pass
        
def parse_index(index_page):
    index_page=index_page.result()
    urls=re.findall('class="items".*?href="(.*?)"',index_page,re.S)       #找到所有屬性類為items的標簽的鏈接地址，re.S表示前面的.*?代表所有字符
    for detail_url in urls:
        ret = re.search('<video.*?source src="(?P<path>.*?)"', res.text, re.S)   #找到所有video標簽的鏈接地址
        detail_url = ret.group("path")
        res = requests.get(detail_url)
        if not detail_url.startswith('http'):
            detail_url='http://www.xiaohuar.com'+detail_url
        pool.submit(get_page,detail_url).add_done_callback(parse_detail)
        
def parse_detail(detail_page):
    detail_page=detail_page.result()
    l=re.findall('id="media".*?src="(.*?)"',detail_page,re.S)
    if l:
        movie_url=l[0]
        if movie_url.endswith('mp4'):
            pool.submit(get_movie,movie_url)
            
def get_movie(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            m=hashlib.md5()
            m.update(str(time.time()).encode('utf-8'))
            m.update(url.encode('utf-8'))
            filepath='%s\%s.mp4' %(movie_path,m.hexdigest())
            with open(filepath,'wb') as f:                        #視頻文件，wb保存
                f.write(response.content)
                print('%s 下載成功' %url)
    except Exception:
        pass
        
def main():
    base_url='http://www.xiaohuar.com/list-3-{page_num}.html'
    for i in range(5):
        url=base_url.format(page_num=i)
        pool.submit(get_page,url).add_done_callback(parse_index)
        
if __name__ == '__main__':
    main()

四、爬蟲模擬登陸github網站

import requests
import re
# 第三次請求，登錄成功之後
    #- 請求之前自己先登錄一下，看一下有沒有referer
    #- 請求新的url，進行其他操作
    #- 查看用戶名在不在裏面
    
#第一次請求GET請求
response1 = requests.get(
    "https://github.com/login",
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
    },
)
authenticity_token = re.findall('name="authenticity_token".*?value="(.*?)"',response1.text,re.S)
r1_cookies =  response1.cookies.get_dict()                   #獲取到的cookie

#第二次請求POST請求
response2 = requests.post(
    "https://github.com/session",
    headers = {
        "Referer": "https://github.com/",
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
    },
    data={
            "commit":"Sign in",
            "utf8":"?",
            "authenticity_token":authenticity_token,
            "login":"zzzzzzzz",
            "password":"xxxx",
zhy..azjash1234
    },
    cookies = r1_cookies
)
print(response2.status_code)
print(response2.history)  #跳轉的歷史狀態碼

#第三次請求，登錄成功之後，訪問其他頁面
r2_cookies = response2.cookies.get_dict()           #拿上cookie，表示登陸狀態，開始訪問頁面
response3 = requests.get(
    "https://github.com/settings/emails",
    headers = {
        "Referer": "https://github.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
    },
    cookies = r2_cookies,
)
print(response3.text)
print("zzzzzzzz" in response3.text)             #返回True說明就成功了

五、高級用法

1、SSL Cert Verification

import requests
respone=requests.get('https://www.12306.cn',
                     cert=('/path/server.crt','/path/key'))
print(respone.status_code)

2、使用代理

#官網鏈接: http://docs.python-requests.org/en/master/user/advanced/#proxies
#代理設置:先發送請求給代理,然後由代理幫忙發送(封ip是常見的事情)
import requests
proxies={
    'http':'http://egon:123@localhost:9743',#帶用戶名密碼的代理,@符號前是用戶名與密碼
    'http':'http://localhost:9743',
    'https':'https://localhost:9743',
}
respone=requests.get('https://www.12306.cn',proxies=proxies)
print(respone.status_code)
#支持socks代理,安裝:pip install requests[socks]
import requests
proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}
respone=requests.get('https://www.12306.cn',proxies=proxies)
print(respone.status_code)

3、超時設置

#兩種超時:float or tuple
#timeout=0.1 #代表接收數據的超時時間
#timeout=(0.1,0.2)#0.1代表鏈接超時  0.2代表接收數據的超時時間
import requests
respone=requests.get('https://www.baidu.com', timeout=0.0001)

4、認證設置

#官網鏈接：http://docs.python-requests.org/en/master/user/authentication/
#認證設置:登陸網站是,彈出一個框,要求你輸入用戶名密碼（與alter很類似），此時是無法獲取html的
# 但本質原理是拼接成請求頭發送
#         r.headers['Authorization'] = _basic_auth_str(self.username, self.password)
#看一看默認的加密方式吧，通常網站都不會用默認的加密設置

import requests
from requests.auth import HTTPBasicAuth
r=requests.get('xxx',auth=HTTPBasicAuth('user','password'))
print(r.status_code)

#HTTPBasicAuth可以簡寫為如下格式
import requests
r=requests.get('xxx',auth=('user','password'))
print(r.status_code)

5、異常處理

import requests
from requests.exceptions import * #可以查看requests.exceptions獲取異常類型
try:
    r=requests.get('http://www.baidu.com',timeout=0.00001)
except ReadTimeout:
    print('===:')
# except ConnectionError:
#     print('-----網絡不通')
# except Timeout:
#     print('aaaaa')
except RequestException:
    print('Error')

6、上傳文件

import requests
files={'file':open('a.jpg','rb')}
respone=requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)

Django爬蟲基本原理及Request和Response分析

detail 密碼 href Go 模塊 ica 正則表達式 ons CI 一、爬蟲互聯網是由網絡設備（網線，路由器，交換機，防火墻等等）和一臺臺計算機連接而成，像一張網一樣。互聯網的核心價值在於數據的共享/傳遞：數據是存放於一臺臺計算機上的，而將計算機互聯到一起的目的就是

python爬蟲基本原理及入門

http safari pre col 分享圖片 ade 如果渲染登陸百度爬蟲：請求目標網站並獲得數據的程序爬蟲的基本步驟：使用python自帶的urllib庫請求百度： import urllib.request response = urllib.req

爬蟲的原理（基本流程，Request與Response，怎麼解決JavaScript渲染的問題，怎麼儲存資料）

什麼是爬蟲？爬蟲就是請求網站並提取資料的自動化程式。爬蟲基本流程 1.發起請求：通過HTTP庫向目標站點發起請求，即傳送一個Request，請求可以包含額外的headers等配置資訊，等待伺服器響應。 2.獲取響應內容：如果伺服器能正常響應，會得到一個Response，其中的內容

（新手向）爬蟲的原理（基本流程，Request與Response，怎麼解決JavaScript渲染的問題，怎麼儲存資料）

什麼是爬蟲？爬蟲就是請求網站並提取資料的自動化程式。爬蟲基本流程 1.發起請求：通過HTTP庫向目標站點發起請求，即傳送一個Request，請求可以包含額外的headers等配置資訊，等待伺服器響應。 2.獲取響應內容：如果伺服器能正常響應，會得到一個Res

django restframwork教程之Request和Response

Request 物件 REST framework 引入了一個擴充套件HttpRequest的請求物件，提供了更靈活的請求解析，Request物件的核心功能是request.data屬性，它類似於request.POST，但是對於Web APIs更實用 reques

Python爬蟲（入門+進階）學習筆記 2-6 Scrapy的Request和Response詳解

上節課我們學習了中介軟體，知道了怎麼通過中介軟體執行反反爬策略。本節課主要介紹Scrapy框架的request物件和response物件通常，Request物件在爬蟲程式中生成並傳遞到系統，直到它們到達下載程式，後者執行請求並返回一個Response物件，

服務端Http請求Request和response原理解析篇（1）

什麼是servlet？ .Sun(oracle)公司制定的一種用來擴充套件Web伺服器功能的元件規範. 元件：在軟體開發行業，符合一定規範，實現部分功能，並且需要部署到容器中才能執行的軟體模組容器：符合一定規範，提供元件執行環境的一個程式什麼是HTTP協議？ Hype

Django（request和response）

原文連結： https://blog.csdn.net/weixin_31449201/article/details/81043326 Django中的請求與響應一.請求request django中的request用來獲取前端傳過來的資料，那麼前端資料的傳送方式有多種，每一種傳輸方式在後端對應的接收

java中Request和Response常用方法及舉例

Request的常用方法: setAttribute() 在Request域中儲存資料 setCharacterEncoding() 設定請求引

爬蟲基本原理介紹和初步實現（以抓取噹噹網圖書資訊為例）

本文程式碼等僅作學習記錄使用一、爬蟲原理網路爬蟲指按照一定的規則（模擬人工登入網頁的方式），自動抓取網路上的程式。簡單的說，就是講你上網所看到頁面上的內容獲取下來，並進行儲存。網路爬蟲的爬行策略分為深度優先和廣度優先。（1）、深度優先深度

Django中request和response中的屬性方法詳述

Django 使用 request 和 response 物件表示系統狀態資料..當請求一個頁面時,Django建立一個 HttpRequest 物件.該物件包含 request 的元資料. 然後 Django 呼叫相應的 view 函式(HttpRequest 物件自動傳遞

Scrapy爬蟲入門教程十一 Request和Response（請求和響應）

開發環境： Python 3.6.0 版本（當前最新） Scrapy 1.3.2 版本（當前最新）目錄請求和響應請求子類響應物件響應子類請求和響應 Scrapy的

微信公眾平臺開發教程（二）基本原理及消息接口

username 普通用戶縮放地理位置 cfb 位置註意獲得基本一、基本原理在開始做之前，大家可能對這個很感興趣，但是又比較茫然。是不是很復雜？很難學啊？其實恰恰相反，很簡單。為了打消大家的顧慮，先簡單介紹了微信公眾平臺的基本原理。微信服務器就相當於一個轉

幾張圖幫你理解 docker 基本原理及快速入門

uil dir commit -name name 地址什麽生成作者 http://www.cnblogs.com/SzeCheng/p/6822905.html 寫的非常好的一篇文章，不知道為什麽被刪除了。利用Google快照，做個存檔。快照地址：

Spring3 MVC 註解（一）---註解基本配置及@controller和 @RequestMapping 常用解釋（轉）

nal context pac 配置註解 com inf 如何文件中一：配置web.xml 1)問題：spring項目中有多個配置文件mvc.xml dao.xml 2）解決：在web.xml中 <init-par

爬蟲基本原理

獲取 get 模式 like family asc shell ros text 推薦:(http://cuiqingcai.com/1052.html),本文是我在看了靜覓的視屏教程後的筆記. 1、一個HTML頁面裏可以有多個URL地址； 2、一個URL只能指向一個HT

MySql數據庫的基本原理及指令

es2017 price 命令 focus href targe chan 刪除數據庫結構 1.什麽是數據庫數據庫就是存儲數據的倉庫，其本質是一個文件系統，數據按照特定的格式將數據存儲起來，用戶可以通過SQL對數據庫中的數據進行增加，修改，刪除及查詢操作。 2、簡介 M

說一下Servlet裏面得request和response

getmethod oca value 參數緩存地址 map () 封裝當一個servlet被調用的時候，我們一般繼承帶協議的httpServlet，大方向上是下圖這樣在這裏面request和response起了什麽作用呢？來細究一下。 request:1.封

java 獲取request和response的一種方法

requestjava獲取request和response：HttpServletResponse response = ((ServletRequestAttributes) RequestContextHolder.getRequestAttributes()).getResponse();//獲取res

request和response中文亂碼問題後臺處理辦法

init resp character etc 構造方法字符字節數組 http pre request接收參數的中文亂碼的處理： GET：方法一：使用String的構造方法： new String(request.getParameter("傳過來的name

Django爬蟲基本原理及Request和Response分析

相關推薦