【Python】動手分析天貓內衣售賣數據，得到你想知道的信息

阿新 • • 發佈：2018-09-14

exce time 2.0 show pro val 代碼中國 control

　　大家好，我是一個老實人，現在我決定用 Python 抓取天貓內衣銷售數據，並分析得到中國女性普遍的罩杯數據，和最受歡迎的內衣顏色是什麽。

　　希望看完之後你能替你女朋友買上一件心怡的內衣。

　　我們先看看分析得到的成果是怎樣的？（講的很詳細，推薦跟著敲一遍）

　　技術分享圖片

　　圖片看不清楚的話，可以把圖片單獨拉到另一個窗口。

　　這裏是分析了一萬條數據得出的結論，可能會有誤差，但是還是希望單身的你們能找到 0.06% 那一批妹紙。

　　下面我會詳細介紹怎麽抓取天貓內衣銷售數據，存儲、分析、展示。

研究天貓網站
抓取天貓評論數據
存儲、分析數據
可視化

研究天貓網站

　　我們隨意進入一個商品的購買界面（能看到評論的那個界面），F12 開發者模式 -- Network 欄 -- 刷新下界面 -- 在如圖的位置搜索 list_ 會看到一個 list_detail_rate.htm?itemId= ....

　　如下圖：【單擊】這個url 能看到返回的是一個 Json 數據，檢查一下你會發現這串 Json 就是商品的評論數據 [‘rateDetail‘][‘rateList‘]

技術分享圖片

　　【雙擊】這個url 你會得到一個新界面，如圖

技術分享圖片

　　看一下這個信息

技術分享圖片

　　這裏的路徑就是獲取評論數據的 url了。這個 URL 有很多參數你可以分析一下每個值都是幹嘛的。

　　itemId 對應的是商品id， sellerId 對應的是店鋪id，currentPage 是當前頁。這裏 sellerId 可以填任意值，不影響數據的獲取。

抓取天貓評論數據

　　寫一個抓取天貓評論數據的方法。getCommentDetail

# 獲取商品評論數據
def getCommentDetail(itemId,currentPage):
    url = ‘https://rate.tmall.com/list_detail_rate.htm?itemId=‘ + str(
        itemId) + ‘&sellerId=2451699564&order=3&currentPage= 
‘ + str(currentPage) + ‘&append=0callback=jsonp336‘
    # itemId 產品id ； sellerId 店鋪id 字段必須有值，但隨意值就行
    html = common.getUrlContent(url)  # 獲取網頁信息
    # 刪掉返回的多余信息
    html = html.replace(‘jsonp128(‘,‘‘) #需要確定是不是 jsonp128
    html = html.replace(‘)‘,‘‘)
    html = html.replace(‘false‘,‘"false"‘)
    html = html.replace(‘true‘,‘"true"‘)

    # 將string 轉換為字典對象
    tmalljson = json.loads(html)
    return tmalljson

　　這裏需要註意的是 jsonp128 這個值需要你自己看一下，你那邊跟我這個應該是不同的。

　　還有幾十 common 這我自己封裝的一個工具類，主要就是上一篇博客裏寫的一些功能，想 requests 和 pymysql 模塊的功能。在文章最後我會貼出來。

　　在上面的方法裏有兩個變量，itemId 和 currentPage 這兩個值我們動態來控制，所以我們需要獲得一批商品id號和評論的最大頁數用來遍歷。

　　寫個獲取商品評論最大頁數的方法 getLastPage

# 獲取商品評論最大頁數
def getLastPage(itemId):
    tmalljson = getCommentDetail(itemId,1)
    return tmalljson[‘rateDetail‘][‘paginator‘][‘lastPage‘] #最大頁數

　　那現在怎麽獲取產品的id 列表呢？我們可以在天貓中搜索商品關鍵字用開發者模式觀察

　　技術分享圖片

　　這裏觀察一下這個頁面的元素分布，很容易就發現了商品的id 信息，當然你可以想辦法確認一下。

　　技術分享圖片

　　　　現在就寫個獲取商品id 的方法 getProductIdList

# 獲取商品id
def getProductIdList():
    url = ‘https://list.tmall.com/search_product.htm?q=內衣‘ # q參數 是查詢的關鍵字
    html = common.getUrlContent(url)  # 獲取網頁信息
    soup = BeautifulSoup(html,‘html.parser‘)
    idList = []
    # 用Beautiful Soup提取商品頁面中所有的商品ID
    productList = soup.find_all(‘div‘, {‘class‘: ‘product‘})
    for product in productList:
        idList.append(product[‘data-id‘])
    return idList

　　現在所有的基本要求都有了，是時候把他們組合起來。

　　在 main 方法中寫剩下的組裝部分

if __name__ == ‘__main__‘:
    productIdList = getProductIdList() #獲取商品id
    initial = 0
    while initial < len(productIdList) - 30:  # 總共有60個商品，我只取了前30個
        try:
            itemId = productIdList[initial]
            print(‘----------‘, itemId, ‘------------‘)
            maxPage = getLastPage(itemId) #獲取商品評論最大頁數
            num = 1
            while num <= maxPage and num < 20: #每個商品的評論我最多取20 頁，每頁有20條評論，也就是每個商品最多只取 400 個評論
                try:
                    # 抓取某個商品的某頁評論數據
                    tmalljson = getCommentDetail(itemId, num)
                    rateList = tmalljson[‘rateDetail‘][‘rateList‘]
                    commentList = []
                    n = 0
                    while (n < len(rateList)):
                        comment = []
                        # 商品描述
                        colorSize = rateList[n][‘auctionSku‘]
                        m = re.split(‘[:;]‘, colorSize)
                        rateContent = rateList[n][‘rateContent‘]
                        dtime = rateList[n][‘rateDate‘]
                        comment.append(m[1])
                        comment.append(m[3])
                        comment.append(‘天貓‘)
                        comment.append(rateContent)
                        comment.append(dtime)
                        commentList.append(comment)
                        n += 1
                    print(num)
                    sql = "insert into bras(bra_id, bra_color, bra_size, resource, comment, comment_time)  value(null, %s, %s, %s, %s, %s)"
                    common.patchInsertData(sql, commentList) # mysql操作的批量插入
                    num += 1
                except Exception as e:
                    num += 1
                    print(e)
                    continue
            initial += 1
        except Exception as e:
            print(e)

　　所有的代碼就這樣完成了，我現在把 common.py 的代碼，還有 tmallbra.py 的代碼都貼出來

# -*- coding:utf-8 -*-
# Author: zww
import requests
import time
import random
import socket
import http.client
import pymysql
import csv

# 封裝requests
class Common(object):
    def getUrlContent(self, url, data=None):
        header = {
            ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8‘,
            ‘Accept-Encoding‘: ‘gzip, deflate, br‘,
            ‘Accept-Language‘: ‘zh-CN,zh;q=0.9,en;q=0.8‘,
            ‘user-agent‘: "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
            ‘cache-control‘: ‘max-age=0‘
        }  # request 的請求頭
        timeout = random.choice(range(80, 180))
        while True:
            try:
                rep = requests.get(url, headers=header, timeout=timeout)  # 請求url地址，獲得返回 response 信息
                # rep.encoding = ‘utf-8‘
                break
            except socket.timeout as e:  # 以下都是異常處理
                print(‘3:‘, e)
                time.sleep(random.choice(range(8, 15)))
            except socket.error as e:
                print(‘4:‘, e)
                time.sleep(random.choice(range(20, 60)))
            except http.client.BadStatusLine as e:
                print(‘5:‘, e)
                time.sleep(random.choice(range(30, 80)))
            except http.client.IncompleteRead as e:
                print(‘6:‘, e)
                time.sleep(random.choice(range(5, 15)))
        print(‘request success‘)
        return rep.text  # 返回的 Html 全文

    def writeData(self, data, url):
        with open(url, ‘a‘, errors=‘ignore‘, newline=‘‘) as f:
            f_csv = csv.writer(f)
            f_csv.writerows(data)
        print(‘write_csv success‘)

    def queryData(self, sql):
        db = pymysql.connect("localhost", "zww", "960128", "test")
        cursor = db.cursor()
        results = []
        try:
            cursor.execute(sql)    #執行查詢語句
            results = cursor.fetchall()
        except Exception as e:
            print(‘查詢時發生異常‘ + e)
            # 如果發生錯誤則回滾
            db.rollback()
        # 關閉數據庫連接
        db.close()
        return results
        print(‘insert data success‘)

    def insertData(self, sql):
        # 打開數據庫連接
        db = pymysql.connect("localhost", "zww", "000000", "zwwdb")
        # 使用 cursor() 方法創建一個遊標對象 cursor
        cursor = db.cursor()

        try:
            # sql = "INSERT INTO WEATHER(w_id, w_date, w_detail, w_temperature) VALUES (null, ‘%s‘,‘%s‘,‘%s‘)" % (data[0], data[1], data[2])
            cursor.execute(sql)    #單條數據寫入
            # 提交到數據庫執行
            db.commit()
        except Exception as e:
            print(‘插入時發生異常‘ + e)
            # 如果發生錯誤則回滾
            db.rollback()
        # 關閉數據庫連接
        db.close()
        print(‘insert data success‘)

    def patchInsertData(self, sql, datas):
        # 打開數據庫連接
        db = pymysql.connect("localhost", "zww", "960128", "test")
        # 使用 cursor() 方法創建一個遊標對象 cursor
        cursor = db.cursor()

        try:
            # 批量插入數據
            # cursor.executemany(‘insert into WEATHER(w_id, w_date, w_detail, w_temperature_low, w_temperature_high) value(null, %s,%s,%s,%s)‘,datas)
            cursor.executemany(sql, datas)

            # 提交到數據庫執行
            db.commit()
        except Exception as e:
            print(‘插入時發生異常‘ + e)
            # 如果發生錯誤則回滾
            db.rollback()
        # 關閉數據庫連接
        db.close()
        print(‘insert data success‘)

　　上面需要註意，數據庫的配置。

# -*- coding:utf-8 -*-
# Author: zww

from Include.commons.common import Common
from bs4 import BeautifulSoup
import json
import re
import pymysql

common = Common()

# 獲取商品id
def getProductIdList():
    url = ‘https://list.tmall.com/search_product.htm?q=內衣‘ # q參數 是查詢的關鍵字，這要改變一下查詢值，就可以抓取任意你想知道的數據
    html = common.getUrlContent(url)  # 獲取網頁信息
    soup = BeautifulSoup(html,‘html.parser‘)
    idList = []
    # 用Beautiful Soup提取商品頁面中所有的商品ID
    productList = soup.find_all(‘div‘, {‘class‘: ‘product‘})
    for product in productList:
        idList.append(product[‘data-id‘])
    return idList

# 獲取商品評論數據
def getCommentDetail(itemId,currentPage):
    url = ‘https://rate.tmall.com/list_detail_rate.htm?itemId=‘ + str(
        itemId) + ‘&sellerId=2451699564&order=3&currentPage=‘ + str(currentPage) + ‘&append=0callback=jsonp336‘
    # itemId 產品id ； sellerId 店鋪id 字段必須有值，但隨意值就行
    html = common.getUrlContent(url)  # 獲取網頁信息
    # 刪掉返回的多余信息
    html = html.replace(‘jsonp128(‘,‘‘) #需要確定是不是 jsonp128
    html = html.replace(‘)‘,‘‘)
    html = html.replace(‘false‘,‘"false"‘)
    html = html.replace(‘true‘,‘"true"‘)

    # 將string 轉換為字典對象
    tmalljson = json.loads(html)
    return tmalljson

# 獲取商品評論最大頁數
def getLastPage(itemId):
    tmalljson = getCommentDetail(itemId,1)
    return tmalljson[‘rateDetail‘][‘paginator‘][‘lastPage‘] #最大頁數

if __name__ == ‘__main__‘:
    productIdList = getProductIdList() #獲取商品id
    initial = 0
    while initial < len(productIdList) - 30:  # 總共有60個商品，我只取了前30個
        try:
            itemId = productIdList[initial]
            print(‘----------‘, itemId, ‘------------‘)
            maxPage = getLastPage(itemId) #獲取商品評論最大頁數
            num = 1
            while num <= maxPage and num < 20: #每個商品的評論我最多取20 頁，每頁有20條評論，也就是每個商品最多只取 400 個評論
                try:
                    # 抓取某個商品的某頁評論數據
                    tmalljson = getCommentDetail(itemId, num)
                    rateList = tmalljson[‘rateDetail‘][‘rateList‘]
                    commentList = []
                    n = 0
                    while (n < len(rateList)):
                        comment = []
                        # 商品描述
                        colorSize = rateList[n][‘auctionSku‘]
                        m = re.split(‘[:;]‘, colorSize)
                        rateContent = rateList[n][‘rateContent‘]
                        dtime = rateList[n][‘rateDate‘]
                        comment.append(m[1])
                        comment.append(m[3])
                        comment.append(‘天貓‘)
                        comment.append(rateContent)
                        comment.append(dtime)
                        commentList.append(comment)
                        n += 1
                    print(num)
                    sql = "insert into bras(bra_id, bra_color, bra_size, resource, comment, comment_time)  value(null, %s, %s, %s, %s, %s)"
                    common.patchInsertData(sql, commentList) # mysql操作的批量插入
                    num += 1
                except Exception as e:
                    num += 1
                    print(e)
                    continue
            initial += 1
        except Exception as e:
            print(e)

存儲、分析數據

　　所有的代碼都有了，就差數據庫的建立了。我這裏用的是 MySql 數據庫。

CREATE TABLE `bra` (
`bra_id`  int(11) NOT NULL AUTO_INCREMENT COMMENT ‘id‘ ,
`bra_color`  varchar(25) NULL COMMENT ‘顏色‘ ,
`bra_size`  varchar(25) NULL COMMENT ‘罩杯‘ ,
`resource`  varchar(25) NULL COMMENT ‘數據來源‘ ,
`comment`  varchar(500) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT ‘評論‘ ,
`comment_time`  datetime NULL COMMENT ‘評論時間‘ ,
PRIMARY KEY (`bra_id`)
) character set utf8
;

　　這裏有兩個地方需要註意， comment 評論字段需要設置編碼格式為 utf8mb4 ，因為可能有表情文字。還有表需要設置為 utf8 編碼，不然存不了中文。

　　建好了表，就可以完整執行代碼了。（這裏的執行可能需要點時間，可以做成多線程的方式）。看一下執行完之後，數據庫有沒有數據。

　　技術分享圖片

　　數據是有了，但是有些我們多余的文字描述，我們可以稍微整理一下。

update bra set bra_color = REPLACE(bra_color,‘2B6521-無鋼圈4-‘,‘‘);
update bra set bra_color = REPLACE(bra_color,‘-1‘,‘‘);
update bra set bra_color = REPLACE(bra_color,‘5‘,‘‘);
update bra set bra_size = substr(bra_size,1,3);

　　這裏需要根據自己實際情況來修改。如果數據整理的差不多了，我們可以分析一下數據庫的信息。

select ‘A罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%A‘
union all select ‘B罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%B‘
union all select ‘C罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%C‘
union all select ‘D罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%D‘
union all select ‘E罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%E‘
union all select ‘F罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%F‘
union all select ‘G罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%G‘
union all select ‘H罩杯‘ as 罩杯, CONCAT(ROUND(COUNT(*)/(select count(*) from bra) * 100, 2) , "%") as 比例, COUNT(*) as 銷量  from bra where bra_size like ‘%H‘
order by 銷量 desc;

　　技術分享圖片

　　（想知道是哪6位小姐姐買的 G (～￣▽￣)～）

數據可視化

　　數據的展示，我用了是 mycharts 模塊，如果不了解的可以去學習一下 http://pyecharts.org/#/zh-cn/prepare

　　這裏我就不細說了，直接貼代碼看

# encoding: utf-8
# author zww

from pyecharts import Pie
from Include.commons.common import Common


if __name__ == ‘__main__‘:
    common = Common()
    results = common.queryData("""select count(*) from bra where bra_size like ‘%A‘ 
            union all select count(*) from bra where bra_size like ‘%B‘ 
            union all select count(*) from bra where bra_size like ‘%C‘ 
            union all select count(*) from bra where bra_size like ‘%D‘ 
            union all select count(*) from bra where bra_size like ‘%E‘ 
            union all select count(*) from bra where bra_size like ‘%F‘ 
            union all select count(*) from bra where bra_size like ‘%G‘""")  # 獲取每個罩杯數量
    attr = ["A罩杯", ‘G罩杯‘, "B罩杯", "C罩杯", "D罩杯", "E罩杯", "F罩杯"]
    v1 = [results[0][0], results[6][0], results[1][0], results[2][0], results[3][0], results[4][0], results[5][0]]
    pie = Pie("內衣罩杯", width=1300, height=620)
    pie.add("", attr, v1, is_label_show=True)
    pie.render(‘size.html‘)
    print(‘success‘)

    results = common.queryData("""select count(*) from bra where bra_color like ‘%膚%‘ 
        union all select count(*) from bra where bra_color like ‘%灰%‘ 
        union all select count(*) from bra where bra_color like ‘%黑%‘ 
        union all select count(*) from bra where bra_color like ‘%藍%‘ 
        union all select count(*) from bra where bra_color like ‘%粉%‘ 
        union all select count(*) from bra where bra_color like ‘%紅%‘ 
        union all select count(*) from bra where bra_color like ‘%紫%‘  
        union all select count(*) from bra where bra_color like ‘%綠%‘ 
        union all select count(*) from bra where bra_color like ‘%白%‘ 
        union all select count(*) from bra where bra_color like ‘%褐%‘ 
        union all select count(*) from bra where bra_color like ‘%黃%‘ """)  # 獲取每個罩杯數量
    attr = ["膚色", ‘灰色‘, "黑色", "藍色", "粉色", "紅色", "紫色", ‘綠色‘, "白色", "褐色", "黃色"]
    v1 = [results[0][0], results[1][0], results[2][0], results[3][0], results[4][0], results[5][0], results[6][0], results[7][0], results[8][0], results[9][0], results[10][0]]
    pieColor = Pie("內衣顏色", width=1300, height=620)
    pieColor.add("", attr, v1, is_label_show=True)
    pieColor.render(‘color.html‘)
    print(‘success‘)

　　這一章就到這裏了，該知道的你也知道了，不該知道的你也知道了。

　　代碼全部存放在 GitHub 上 https://github.com/zwwjava/python_capture

【Python】動手分析天貓內衣售賣數據，得到你想知道的信息

exce time 2.0 show pro val 代碼中國 control 　　大家好，我是一個老實人，現在我決定用 Python 抓取天貓內衣銷售數據，並分析得到中國女性普遍的罩杯數據，和最受歡迎的內衣顏色是什麽。　　希望看完之後你能替你女朋友買上一件心怡的內衣

【Python】動手分析天貓內衣售賣數據，得到你想知道的信息

【Python】動手分析天貓內衣售賣數據，得到你想知道的信息

【Python】動手分析天貓內衣售賣資料，得到你想知道的資訊

【Python】資料分析之numpy包

【Python】學習筆記4-內置函數

【Python】給定一個數組A[0,…,n-1]，求A的連續子陣列，使得該子陣列的和最大

【Python】最長括號匹配問題：給定字串，僅包含左括號‘(’和右括號‘)’，它可能不是括號匹配的，設計演算法，找出最長匹配的括號子串

【轉載】 Faster-RCNN+ZF用自己的數據集訓練模型(Matlab版本)

【VB.NET】利用純真IP數據庫查詢IP地址及信息

【JavaScript】a標簽onclick傳遞參數不對，A標簽調用js函數寫法總結

【PHP】通過header發送自定義數據

【轉】Py西遊攻關之基礎數據類型

【轉】在使用實體框架（Entity Framework）的應用中加入審計信息（Audit trail）跟蹤數據的變動

【雜談】野生在左科班在右——數據結構學習誓師貼

【Oracle】查詢字段的長度、類型、精度、註釋等信息

【轉】mysql實現隨機獲取幾條數據的方法

內網中用python分析數據包中的QQ活動信息

用Python抓取並分析了1982場英雄聯盟數據，教你開局前預測遊戲對局勝負！

【雜談】tocmat是何時寫回響應數據報的

【多圖教程】服務器恢復誤刪除的數據，netAPP存儲誤刪除數據恢復教程

【轉載】SQL語句用一個表的數據更新另一個表

【Python】動手分析天貓內衣售賣數據，得到你想知道的信息

相關推薦