爬蟲[1]---頁面分析及資料抓取

阿新 • • 發佈：2018-11-26

頁面分析及資料抓取

anaconda + scrapy 安裝：https://blog.csdn.net/dream_dt/article/details/80187916
用 scrapy 初始化一個爬蟲：https://blog.csdn.net/dream_dt/article/details/80188592

要爬的網頁：
在這裡插入圖片描述

複製網址後，在 Anaconda Prompt 中，cd 到專案所在目錄，輸入：

scrapy genspider skirt https://item.taobao.com/item.htm?id=537194970660&ali_refid=a3_430673_1006:1105679232:N:%E5%A5%B3%E8%A3%85:30942d37a432dd6b95fad6c34caf5bd5&ali_trackid=1_30942d37a432dd6b95fad6c34caf5bd5&spm=a2e15.8261149.07626516002.3

會生成一個新的檔案：
在這裡插入圖片描述

skirt.py

# -*- coding: utf-8 -*-
import scrapy


class SkirtSpider(scrapy.Spider):
    name = 'skirt'
    allowed_domains = ['https://item.taobao.com']  # 只保留域名
    start_urls = ['http://https://item.taobao.com/item.htm?id=537194970660/']  # 主頁面

    def parse(self, response):
        pass

建立一個 main 函式：
在這裡插入圖片描述

main.py

# -*- coding: utf-8 -*-

from scrapy.cmdline import execute

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))  # 匯入執行路徑

execute(["scrapy", "crawl", "skirt"])  # 在 shell 中執行命令

修改 settings.py，否則會把所有不符合規則的頁面過濾掉

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

獲取單價

在 Anaconda Prompt 中輸入

scrapy shell https://item.taobao.com/item.htm?id=537194970660&ali_refid=a3_430673_1006:1105679232:N:%E5%A5%B3%E8%A3%85:30942d37a432dd6b95fad6c34caf5bd5&ali_trackid=1_30942d37a432dd6b95fad6c34caf5bd5&spm=a2e15.8261149.07626516002.3

在這裡插入圖片描述
因為 tb-rmb-num 只有一個，可以用 class 選擇器定位，在 Anaconda Prompt 中輸入

response.css('.tb-rmb-num')

在這裡插入圖片描述

為了拿到 128 那個值，輸入

response.css('.tb-rmb-num::text')

在這裡插入圖片描述

response.css('.tb-rmb-num::text').extract()

在這裡插入圖片描述

response.css('.tb-rmb-num::text').extract()[0]

在這裡插入圖片描述
修改 skirt.py

def parse(self, response):
        
        # 獲取單價
        price = response.css('.tb-rmb-num::text').extract()[0]

獲取評論數

在這裡插入圖片描述
在 Anaconda Prompt 中輸入

response.css('#J_RateCounter::text').extract()

在這裡插入圖片描述
沒有拿到資料，說明該資料可能是通過 js 動態渲染的，需要分析原始碼。

將 rateCountterApi 對應的 url 複製到瀏覽器中，得到如下結果

可見該評論數位於第一條 script 中的 rateCounterApi 欄位。
在 Anaconda Prompt 中輸入

response.css('script::text')[0].extract()

在這裡插入圖片描述
修改 skirt.py

def parse(self, response):
        
        # 獲取單價
        price = response.css('.tb-rmb-num::text').extract()
        
        # 獲取頁面指令碼渲染的第一個結構
        first_js_script = response.css('script::text')[0].extract()

篩選資料，獲取評論數

第一條 js 資料中第一個欄位的開頭
在這裡插入圖片描述
第一條 js 資料中第一個欄位的結尾

用正則表示式匹配出第一條 js 資料中第一個欄位。
\s：空白符
\S：非空白符
*：重複任意多次

匹配出的結果是一個 list，skirt.py 中新增

import re
g_config = re.findall('var g_config = ([\s\S]*)g_config.tadInfo', first_js_script)[0]

需要匹配出 rateCounterApi 對應的 url
在這裡插入圖片描述

skirt.py 中新增

rate_counter_api = re.findall("rateCounterApi   : '//(.*)',", g_config)[0]

訪問獲取評論的 url，skirt.py 中新增，

import requests
rate_count_response = resquests.get("http://" + rate_counter_api)  # null({"count":627})

得到評論數量，skirt.py 中新增，

rate_count = re.findall('"count":(.*)}', rate_count_response)[0]

現在，skirt.py 成為：

# -*- coding: utf-8 -*-
import scrapy
import re
import requests

class SkirtSpider(scrapy.Spider):
    name = 'skirt'
    allowed_domains = ['https://item.taobao.com']  # 只保留域名
    start_urls = ['https://item.taobao.com/item.htm?id=537194970660&ali_refid=a3_430673_1006:1105679232:N:%E5%A5%B3%E8%A3%85:30942d37a432dd6b95fad6c34caf5bd5&ali_trackid=1_30942d37a432dd6b95fad6c34caf5bd5&spm=a2e15.8261149.07626516002.3']
    
    def parse(self, response):
        
        # 獲取單價
        price = response.css('.tb-rmb-num::text').extract()
        
        ''' 獲取評論數量 '''
        
        # 獲取頁面指令碼渲染的第一個結構
        first_js_script = response.css('script::text')[0].extract()[0]
        
        g_config = re.findall('var g_config = ([\s\S]*)g_config.tadInfo', first_js_script)[0]
        
        rate_counter_api = re.findall("rateCounterApi   : '//(.*)',", g_config)[0]
        
        # 訪問獲取評論的 url
        rate_count_response = requests.get("http://" + rate_counter_api)  # null({"count":627})

        # 獲取評論數量
        rate_count = re.findall('"count":(.*)}', rate_count_response.text)[0]
    
        print(price)
        print(rate_count)

不知為什麼，執行 mian 函式後會出現錯誤：

 raise error.ReactorNotRestartable()
	ReactorNotRestartable

於是在 Anaconda Prompt 中輸入：

scrapy crawl skirt

在這裡插入圖片描述
得到結果：

獲取具體的評論資訊

在這裡插入圖片描述
同樣不能從頁面直接獲取，是通過 js 指令碼從別的地方載入進來的，與評論數量的獲取方式類似。

選擇“Network”選項，清空一下
在這裡插入圖片描述

點選評論第 2 頁，可以看到一些結果

在這裡插入圖片描述

點選那個連結，驗證是否為評論資訊
在這裡插入圖片描述
該如何獲取該評論資訊？切換到“Headers”選項，可以看到 Request URL

將該 url 複製後，複製到新的瀏覽器查詢頁面後，可以得到：

所以接下來分析該 url，裡面有哪些東西是可用的。該 url 具體如下

https://rate.taobao.com/feedRateList.htm?auctionNumId=537194970660&userNumId=794196473&currentPageNum=2&pageSize=20&rateType=&orderType=sort_weight&attribute=&sku=&hasSku=false&folded=0&ua=098%23E1hvKpvWvPvvUvCkvvvvvjiPR2MhAjnCPLd9zjrCPmP9gjDnRFsy6jEmn2MUsjrm9phvVZ2MWlAQ7rMNz1CKz8otUSiswIFU1qYbJzWPvpvhvv2MMQyCvhQhhQyvCsxleExrV8t%2Bm7zhaf9gKFnfIExrs8TZfvDrAjc6%2Bul1bbmxfwp4d56Ofw3l%2Bb8rwkM6D7zhVut%2Bm7zh6j7J%2B3%2BiafmxfBeKKphv8vvvvvCvpvvvvvv2vhCvmnGvvvWvphvW9pvvvQCvpvs9vvv2vhCv2RmivpvUvvmv%2BQeoltAEvpvVmvvC9jXmvphvC9vhvvCvp2yCvvpvvvvviQhvCvvv9U8jvpvhvvpvv2yCvvpvvvvvdphvmQ9ZW9UYPpLY5gyA&_ksTS=1543128829263_1021&callback=jsonp_tbcrate_reviews_list

可以看到用 & 符號連線一些引數。以下這部分 url 比較好懂，而後面的部分不好懂，先忽略，用以下新的 url 開啟新的頁面

https://rate.taobao.com/feedRateList.htm?auctionNumId=537194970660&userNumId=794196473&currentPageNum=2&pageSize=20

在這裡插入圖片描述
可以看到請求成功，說明 url 後面的部分沒有用。

同理，userNumId 引數沒有用，可刪掉，將 currentPageNum = 1，可得到如下頁面

https://rate.taobao.com/feedRateList.htm?auctionNumId=537194970660&currentPageNum=1&pageSize=20

在這裡插入圖片描述
而引數 pageSize 表示一次請求多少條評論資訊，預設是 20 條。

接下通過拼接的方式得到以下類似的 url，以獲取具體評論資訊。

https://rate.taobao.com/feedRateList.htm?auctionNumId=537194970660&currentPageNum=2&pageSize=20

在這裡插入圖片描述

在 Anaconda Prompt 中輸入

response.css('#reviews')

在這裡插入圖片描述
然後從中獲取 url

data_listapi_url = response.css('#reviews::attr(data-listapi)').extract()[0]

在這裡插入圖片描述

# 評論的 url
feed_rate_list_url = re.findall("\/\/(.*?)\?", data_listapi_url)[0]

\：轉義字元
?：遇到第一個匹配的後停止
在這裡插入圖片描述

# 寶貝 id
auction_num_id = re.findall("auctionNumId=(.*?)&", data_listapi_url)[0]

在這裡插入圖片描述

# 計算一共有多少頁的評論
pages = math.ceil(int(rate_count) / page_size)

拼接各種引數，得到 url

for current_page_number in range(1, pages):
	yield scrapy.Request(url = "http://"+ feed_rate_list_url 
			+ "?auctionNumId=" + auction_num_id
			+ "&currentPageNum=" + str(current_page_number)
			+ "&pageSize=" + str(page_size),
		         callback = self.parse_rate_list)

回撥函式

def parse_rate_list(self, response):
	print(response.text)

現在，skirt.py 如下

# -*- coding: utf-8 -*-
import scrapy
import re
import requests
import math

class SkirtSpider(scrapy.Spider):
    name = 'skirt'
    
    # 允許的域名，如果沒有把域名寫在這裡，那麼將會過濾掉
    allowed_domains = ['item.taobao.com',
                       'rate.taobao.com']  # 只保留域名
    # 從哪個頁面開始抓取
    start_urls = ['https://item.taobao.com/item.htm?id=537194970660&ali_refid=a3_430673_1006:1105679232:N:%E5%A5%B3%E8%A3%85:30942d37a432dd6b95fad6c34caf5bd5&ali_trackid=1_30942d37a432dd6b95fad6c34caf5bd5&spm=a2e15.8261149.07626516002.3']
    
    def parse(self, response):
        
        # 連衣裙的單價
        price = response.css('.tb-rmb-num::text').extract()[0]
        
        ''' 獲取評論數量 '''
        
        # 獲取到頁面渲染的第一個指令碼的資料結構
        first_js_script = response.css('script::text')[0].extract()  
        
        # 正則匹配到g_config欄位
        g_config = re.findall('var g_config = ([\s\S]*)g_config.tadInfo', first_js_script)[0]
        
        # 正則匹配，拿到頁面的評論 url
        rate_counter_api = re.findall("rateCounterApi   : '//(.*)',", g_config)[0]
        
        # 訪問獲取評論的 url
        rate_count_response = requests.get("http://" + rate_counter_api)  # null({"count":627})

        # 獲取評論數量
        rate_count = re.findall('"count":(.*)}', rate_count_response.text)[0]
    
        # 該請求可獲取具體的評論資訊
        # https://rate.taobao.com/feedRateList.htm?auctionNumId=537194970660&currentPageNum=2&pageSize=20
        
        # 拿到 data_listapi_url，這個能夠匹配到域名
        data_listapi_url = response.css('#reviews::attr(data-listapi)').extract()[0]
    
        # 獲取到評論的 url
        feed_rate_list_url = re.findall("//(.*?)\?", data_listapi_url)[0]
        
        # 寶貝 id
        auction_num_id = re.findall("auctionNumId=(.*?)&", data_listapi_url)[0]

        # 設定一個值，一頁獲取的評論數量
        page_size = 20
        
        # 計算一共有多少頁的評論
        pages = math.ceil(int(rate_count) / page_size)

        # 迭代一共有多少頁，然後分別請求每一頁評論
        for current_page_number in range(1, pages):
            yield scrapy.Request(url = "http://"+ feed_rate_list_url 
                                 + "?auctionNumId=" + auction_num_id
                                 + "&currentPageNum=" + str(current_page_number)
                                 + "&pageSize=" + str(page_size),
                                 callback = self.parse_rate_list)

    # 解析具體的評論
    def parse_rate_list(self, response):
        print(response.text)

執行

scrapy crawl skirt

在這裡插入圖片描述
結果如下

爬蟲[1]---頁面分析及資料抓取

頁面分析及資料抓取

獲取單價

獲取評論數

篩選資料，獲取評論數

獲取具體的評論資訊

爬蟲[1]---頁面分析及資料抓取

爬蟲實戰-酷狗音樂資料抓取--XPath，Pyquery,Beautifulsoup資料提取對比實戰

Python 爬蟲工程師必學——App資料抓取實戰

某課《Python 爬蟲工程師必學 App資料抓取實戰》

Python 爬蟲工程師必學 App資料抓取實戰

Python 爬蟲工程師必學 App資料抓取實戰目前最完整

【網路爬蟲】【java】微博爬蟲（二）：如何抓取HTML頁面及HttpClient使用

Python爬蟲之使用正則表示式抓取資料

C# NetCore使用AngleSharp爬取周公解夢資料 MySql資料庫的自動建立和頁面資料抓取

爬蟲原理與資料抓取-----HTTP和HTTPS的請求與響應

爬蟲原理與資料抓取----- urllib2：GET請求和POST請求

爬蟲—01-爬蟲原理與資料抓取

爬蟲（一）：爬蟲原理與資料抓取

爬蟲--python3.6+selenium+BeautifulSoup實現動態網頁的資料抓取，適用於對抓取頻率不高的情況

網頁資料抓取--爬蟲

Python爬蟲：十分鐘實現從資料抓取到資料API提供

網路爬蟲/資料抓取，反爬蟲（更新版）

新浪微博爬蟲分享（一天可抓取 1300 萬條資料）

Python爬蟲入門教程 21-100 網易雲課堂課程資料抓取

Python爬蟲入門教程 22-100 CSDN學院課程資料抓取

爬蟲[1]---頁面分析及資料抓取

頁面分析及資料抓取

獲取單價

獲取評論數

篩選資料，獲取評論數

獲取具體的評論資訊

相關推薦