03：requests與BeautifulSoup結合爬取網頁數據應用

阿新 • • 發佈：2018-03-11

fas bsp 2.3 m2e bae DC 信息 type 取數

1.1 爬蟲相關模塊命令回顧

　　1、requests模塊

1、 pip install requests

2、 response = requests.get(‘http://www.baidu.com/ ‘) #獲取指定url的網頁內容

3、 response.text #獲取文本文件

4、 response.content #獲取字節類型

5、 response.encoding = ‘utf-8’ #指定獲取的網頁內容用utf-8編碼

response.encoding = response.apparent_encoding #下載的頁面是什麽編碼就用什麽編碼格式

6、 response.cookies #拿到cookies

response.cookies.get_dict() #拿到cookie字典樣式

2、beautisoup模塊

1、 pip install beautifulsoup4

2、把文本轉成對象

　　　　　　　　1）html.parser 是python內置模塊無需安裝

　　　　　　　　　　soup = BeautiSoup(response.text,parser=‘html.parser‘)

　　　　　　　　2）lxml是第三方庫，但是性能好（生產用這個

soup = BeautifulSoup(response.text,features=‘lxml‘)

3、 .find()用法：返回的是對象

　　　　　　　　1）從爬取的內容找到id="auto-channel-lazyload-article" 中div的內容

target = soup.find(id="auto-channel-lazyload-article")

　　　　　　　　2）從爬取的內容中找到一個div，並且這個div有一個屬性是id=’i1’

target = soup.find(‘div‘,id=‘i1‘)

4、 .find_all()用法：返回的是對象列表

1）從以後取的target對象中找到所有li標簽

li_list = target.find_all(‘li‘)

5、從.find()獲取的對象中找到想要的屬性

　　　　　　　　a.attrs.get(‘href‘) #獲取所有a標簽的所有href屬性（a標簽url路徑）

　　　　　　　　a.find(‘h3‘).text #找到a標簽中的所有h3標簽，的內容

　　　　　　　　img_url = a.find(‘img‘).attrs.get(‘src‘) #從a標簽中找到img標簽所有src屬性(圖片url路徑)

1.2 爬取需要登錄和不需要登錄頁面內容的方法

import requests
from bs4 import BeautifulSoup
response = requests.get(
   url=‘http://www.autohome.com.cn/news/‘
)

response.encoding = response.apparent_encoding          #下載的頁面是什麽編碼就用什麽編碼格式

#1 把文本轉成對象，
#soup = BeautifulSoup(response.text,features=‘lxml‘)        #lxml是第三方庫，但是性能好（生產用這個）
soup = BeautifulSoup(response.text,features=‘html.parser‘)  # html.parser 是python內置模塊無需安裝

#2 從爬取的內容找到id="auto-channel-lazyload-article" 中div的內容
target = soup.find(id="auto-channel-lazyload-article")

#3.1 找到所有li標簽 .find()是找到第一個
#3.2 也可以這樣用： .find(‘div‘,id=‘i1‘)  可以使用這種組合查找的方法
#3.3 .find()找到的是對象，.find_all() 獲取的是列表
li_list = target.find_all(‘li‘)

for i in li_list:
   a = i.find(‘a‘)
   if a:
      print(a.attrs.get(‘href‘))                   #獲取所有a標簽的url路徑
      # a.find(‘h3‘) 獲取的是對象， 加上 .text才是獲取文本
      txt = a.find(‘h3‘).text                      #從a標簽中找到所有h3標簽的值
      print(txt,type(txt))
      img_url = a.find(‘img‘).attrs.get(‘src‘)#從a標簽中找到img標簽所有src屬性(圖片url路徑)
      import uuid
      file_name = str(uuid.uuid4()) + ‘.jpg‘

      if img_url.startswith(‘//www2‘):        #由於獲取的圖片url做了處理，所以才這樣處理
         img_url2 = img_url.replace(‘//www2‘,‘http://www3‘)
         img_response = requests.get(url=img_url2)
         with open(file_name,‘wb‘) as f:
            f.write(img_response.content)       #把圖片寫到本地

例1：爬取汽車之家新聞頁面（爬取無需登錄的網頁）技術分享圖片

import requests

#1 登錄抽屜網站的用戶名和密碼放到字典裏
post_dict = {
   "phone":‘86185387525‘,
   ‘password‘:‘74810‘,
   ‘oneMonth‘:1
}

#2 將密碼字典以post方式提交到抽屜的登錄界面
response = requests.post(
   url = ‘http://dig.chouti.com/login‘,
   data=post_dict
)

#3下面就是成功登錄抽屜的返回值
print(response.text)
# {"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_49844923242"}}}

#4 下面是打印成功登錄抽屜後返回的的cookie字典
cookie_dict = response.cookies.get_dict()
print(cookie_dict)
#{‘JSESSIONID‘: ‘aaaVizwwcod_L5QcwwR9v‘, ‘puid‘: ‘d332ef55361217e544b91f081090ad5e‘,
#  ‘route‘: ‘37316285ff8286c7a96cd0b03d38e13b‘, ‘gpsd‘: ‘f8b07e259141ae5a11d930334fbfb609‘}

#5 當我們每次需要訪問抽屜登錄後才能看的信息時，就可以在url中添加登錄成返回的cookie字典
response=requests.get(
   url=‘http://dig.chouti.com/profile‘,
   cookies = cookie_dict
)

例2：自動登錄抽屜並獲取用戶配置頁面的信息（cookie方式）

1.3 使用爬蟲登錄案例總結

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests

# ## 1、首先登陸任何頁面，獲取cookie
i1 = requests.get(url="http://dig.chouti.com/")
i1_cookies = i1.cookies.get_dict()

# ## 2、用戶登陸，攜帶上一次的cookie，後臺對cookie中的 gpsd 進行授權
i2 = requests.post(
    url="http://dig.chouti.com/login",
    data={
        ‘phone‘: "8618538752511",
        ‘password‘: "7481079xl",
        ‘oneMonth‘: ""
    },
    cookies=i1_cookies
)

# ## 3、點贊（只需要攜帶已經被授權的gpsd即可）
gpsd = i1_cookies[‘gpsd‘]
i3 = requests.post(
    url="http://dig.chouti.com/link/vote?linksId=15074576",
    cookies={‘gpsd‘: gpsd}
)
print(i3.text)

例1：方式一: 使用cookie方式點贊抽屜技術分享圖片

import requests

session = requests.Session()
i1 = session.get(url="http://dig.chouti.com/help/service")
i2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        ‘phone‘: "8618538752511",
        ‘password‘: "7481079xl",
        ‘oneMonth‘: ""
    },
)
i3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=15074576"
)
print(i3.text)

例2：方式二: 使用session方式點贊抽屜技術分享圖片

import requests
from bs4 import BeautifulSoup

# 第一步：獲取csrf
# 1.1 獲取login頁面
r1 = requests.get(url=‘https://github.com/login‘)
# 1.2 接文本文件解析成對象
b1 = BeautifulSoup(r1.text,‘html.parser‘)
# 1.3 找到csrf_token標簽
tag = b1.find(name=‘input‘,attrs={‘name‘:‘authenticity_token‘})
#1.4 獲取csrf_token的值
# tag.get(‘value‘)等價於 tag.attrs.get(‘values‘)
token = tag.get(‘value‘)                # 獲取csrf_token的值
#1.5 獲取第一次發送get請求返回的cookies字典
r1_cookie = r1.cookies.get_dict()       #獲取第一次發get請求返回的cookie
print(‘第一次‘,r1_cookie)

# 第二步：發送post請求，攜帶用戶名 密碼，和第一次get請求返回的cookie，後臺進行授權
#2.1 攜帶：csrf_token,cookies,用戶名，密碼 發送post請求登錄
# requests.post() 等價於  requests.request(‘post‘,)
r2 = requests.post(
   url=‘https://github.com/session‘,
   data={                        #這裏data字典必須和實際登錄的格式相同
      ‘commit‘:‘Sign in‘,
      ‘utf8‘:‘?‘,
      ‘authenticity_token‘:token,
      ‘login‘:‘[email protected]‘,
      ‘password‘:‘7481079xl‘,
   },
   cookies = r1_cookie,

)
#2.2 獲取第二次返回的cookies字典
r2_cookie = r2.cookies.get_dict()
print(‘第二次‘,r2_cookie)
#2.3 將兩次獲取的cookie字典整合成一個：沒有重合就用r1_cookie,有重合的就用r2_cookie更新這個字典
r1_cookie.update(r2_cookie)

# 第三步：訪問個人頁面，攜帶cookie
r3 = requests.get(
   url=‘https://github.com/settings/profile‘,
   cookies = r1_cookie,                  # 獲取數據時攜帶登錄成功的cookie
)
print(r3.text)

例3：使用爬蟲登錄github並獲取用戶配置信息技術分享圖片

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import time

import requests
from bs4 import BeautifulSoup

session = requests.Session()

i1 = session.get(
    url=‘https://www.zhihu.com/#signin‘,
    headers={
        ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36‘,
    }
)

soup1 = BeautifulSoup(i1.text, ‘lxml‘)
xsrf_tag = soup1.find(name=‘input‘, attrs={‘name‘: ‘_xsrf‘})
xsrf = xsrf_tag.get(‘value‘)

current_time = time.time()
i2 = session.get(
    url=‘https://www.zhihu.com/captcha.gif‘,
    params={‘r‘: current_time, ‘type‘: ‘login‘},
    headers={
        ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36‘,
    })

with open(‘zhihu.gif‘, ‘wb‘) as f:
    f.write(i2.content)

captcha = input(‘請打開zhihu.gif文件，查看並輸入驗證碼：‘)
form_data = {
    "_xsrf": xsrf,
    ‘password‘: ‘xxooxxoo‘,
    "captcha": ‘captcha‘,
    ‘email‘: ‘[email protected]‘
}

i3 = session.post(
    url=‘https://www.zhihu.com/login/email‘,
    data=form_data,
    headers={
        ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36‘,
    }
)

i4 = session.get(
    url=‘https://www.zhihu.com/settings/profile‘,
    headers={
        ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36‘,
    }
)

soup4 = BeautifulSoup(i4.text, ‘lxml‘)
tag = soup4.find(id=‘rename-section‘)
nick_name = tag.find(‘span‘,class_=‘name‘).string
print(nick_name)

例4：登錄知乎

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
import json
import base64

import rsa
import requests

def js_encrypt(text):
    b64der = ‘MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB‘
    der = base64.standard_b64decode(b64der)

    pk = rsa.PublicKey.load_pkcs1_openssl_der(der)
    v1 = rsa.encrypt(bytes(text, ‘utf8‘), pk)
    value = base64.encodebytes(v1).replace(b‘\n‘, b‘‘)
    value = value.decode(‘utf8‘)

    return value

session = requests.Session()
i1 = session.get(‘https://passport.cnblogs.com/user/signin‘)
rep = re.compile("‘VerificationToken‘: ‘(.*)‘")
v = re.search(rep, i1.text)
verification_token = v.group(1)

form_data = {
    ‘input1‘: js_encrypt(‘wptawy‘),
    ‘input2‘: js_encrypt(‘asdfasdf‘),
    ‘remember‘: False
}
i2 = session.post(url=‘https://passport.cnblogs.com/user/signin‘,
                  data=json.dumps(form_data),
                  headers={
                      ‘Content-Type‘: ‘application/json; charset=UTF-8‘,
                      ‘X-Requested-With‘: ‘XMLHttpRequest‘,
                      ‘VerificationToken‘: verification_token}
                  )

i3 = session.get(url=‘https://i.cnblogs.com/EditDiary.aspx‘)
print(i3.text)

例5：登錄博客園

03：requests與BeautifulSoup結合爬取網頁數據應用

fas bsp 2.3 m2e bae DC 信息 type 取數 1.1 爬蟲相關模塊命令回顧　　1、requests模塊 1、 pip install requests 2、 response =

使用webdriver+urllib爬取網頁數據

環境都是 mac net www med har turn 當我 urilib是python的標準庫，當我們使用Python爬取網頁數據時，往往用的是urllib模塊，通過調用urllib模塊的urlopen(url)方法返回網頁對象，並使用read()方法獲得url的h

python之爬取網頁數據總結（一）

固定環境變量 http lec 了解線程 rom 第一個正則今天嘗試使用python，爬取網頁數據。因為python是新安裝好的，所以要正常運行爬取數據的代碼需要提前安裝插件。分別為requests Beautifulsoup4 lxml 三個插件。因

3.10爬取網頁數據示例（二）

lec href icu fin done mage con img else import requestsimport osimport bs4url=‘http://xkcd.com‘ml=‘F:\ABD‘os.makedirs(ml,exist_ok=True)wh

Python爬蟲scrapy框架爬取動態網站——scrapy與selenium結合爬取資料

scrapy框架只能爬取靜態網站。如需爬取動態網站，需要結合著selenium進行js的渲染，才能獲取到動態載入的資料。如何通過selenium請求url，而不再通過下載器Downloader去請求這個url?方法：在request物件通過中介軟體的時候，在中介軟體內部開始

python網路爬蟲例項：Requests+正則表示式爬取貓眼電影TOP100榜

一、前言最近在看崔慶才先生編寫的《Python3網路爬蟲開發實戰》這本書，學習了requests庫和正則表示式，爬取貓眼電影top100榜單是這本書的第一個例項，主要目的是要掌握requests庫和正則表示式在實際案例中的使用。二、開發環境執行平

Python題目5：爬取CFDA數據

get yun div ont header lac 函數信息 con import requests class Cfda: # 初始化函數 def __init__(self): # 初始化要提交數據的網址 self

python 使用selenium和requests爬取頁面數據

ret pre tex 爬取 test user 發現 rom request 目的：獲取某網站某用戶下市場大於1000秒的視頻信息 1.本想通過接口獲得結果，但是使用post發送信息到接口，提示服務端錯誤。 2.通過requests獲取頁面結果，使用html解析工具，發現

Python爬蟲：selenium掛shadowsocks代理爬取網頁內容

selenium掛ss代理爬取網頁內容 from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import

網路爬蟲：使用多執行緒爬取網頁連結

前言：經過前面兩篇文章，你想大家應該已經知道網路爬蟲是怎麼一回事了。這篇文章會在之前做過的事情上做一些改進，以及說明之前的做法的不足之處。思路分析： 1.邏輯結構圖上圖中展示的就是我們網路爬蟲中的整個邏輯思路（呼叫Python解析URL，這裡只作了簡略

Python爬取大量數據時防止被封IP

ble tree user range ask ron 都沒有進一步 pri 繼續老套路，這兩天我爬取了豬八戒上的一些數據網址是：http://task.zbj.com/t-ppsj/p1s5.html，可能是由於爬取的數據量有點多吧，結果我的IP被封了，需要自己手動來

1.scrapy爬取的數據保存到es中

create date() city sql none tin alc set reat 先建立es的mapping，也就是建立在es中建立一個空的Index，代碼如下：執行後就會在es建lagou 這個index。 from datetime import

大神教你如果學習Python爬蟲如何才能高效地爬取海量數據

Python 爬蟲分布式大數據編程 Python如何才能高效地爬取海量數據我們都知道在互聯網時代，數據才是最重要的，而且如果把數據用用得好的話，會創造很大的價值空間。但是沒有大量的數據，怎麽來創建價值呢？如果是自己的業務每天都能產生大量的數據，那麽數據量的來源問題就解決啦，但是沒有數

爬蟲系列之鏈家的信息爬取及數據分析

enc lib art andro 函數 strip 一次 read 訪問關於鏈家的數據爬取和分析已經實現 1.房屋數據爬取並下載 2.房屋按區域分析 3.房屋按經紀人分析 4.前十經紀人 5.經紀人最有可能的位置分析 6.實現以地區劃分房屋目前存在

爬取貓眼數據

api lms () ons 請求 .data nts end 城市 //源碼 # # 導包#import pyximportimport requestsfrom fake_useragent import UserAgentimport json import os

利用linux curl爬取網站數據

sed 紅色 9.png 規則 pad 內容 zha 執行 wget 看到一個看球網站的以下截圖紅色框數據，想爬取下來，通常爬取網站數據一般都會從java或者python爬取，但本人這兩個都不會，只會shell腳本，於是硬著頭皮試一下用shell爬取，方法很笨重，但旨在

Python爬取房產數據，在地圖上展現！

exc pre 解析 see 爬取註意 app domain 數據庫連接小夥伴，我又來了，這次我們寫的是用python爬蟲爬取烏魯木齊的房產數據並展示在地圖上，地圖工具我用的是 BDP個人版-免費在線數據分析軟件，數據可視化軟件，這個可以導入csv或者excel數據。

另類爬取表格數據

但是 code request 獲取 import 裏的 www. date panda import pandas as pd df = pd.read_html("http://www.air-level.com/air/beijing/", encoding

爬取flash數據

serialize 查看引入 repo list() eid lse blazeds 房產關於html爬取數據的文章已經有很多了，我今天主要和大家交流的是如何爬取flash網頁的數據。這方面資料相對比較少，主要是html5興起後現在flash站很少了，不過用於技術

Java抓取網頁數據（原網頁+Javascript返回數據）

class mail 搜索引擎網頁數據點擊 ann 技術 while span 轉載請註明出處！原文鏈接：http://blog.csdn.net/zgyulongfei/article/details/7909006 有時候由於種種原因，我們需要采集某個網站的數

03：requests與BeautifulSoup結合爬取網頁數據應用

1.1 爬蟲相關模塊命令回顧

1.2 爬取需要登錄和不需要登錄頁面內容的方法

1.3 使用爬蟲登錄案例總結

相關推薦