一個完整的大作業：淘寶口紅銷量top10的銷量和評價

阿新 • • 發佈：2017-11-02

gen 匹配我們 es2017 對象啟用網站 rgs cep

網站：淘寶口紅搜索頁

https://s.taobao.com/search?q=%E5%8F%A3%E7%BA%A2&sort=sale-desc
先爬取該頁面前十的口紅的商品名、銷售量、價格、評分以及評論數，發現該網頁使用了json的方式，使用正則表達式匹配字段，抓取我們
所需要的信息。啟用用戶代理爬取數據，預防該網站的反爬手段，並把結果存入到csv文件中，效果如下。

技術分享

成功爬取到淘寶口紅top10的基本信息後，發現評論並不在同一頁面上，並且該頁面存在著進入評論頁的關鍵字，爬取下來後放入一個列表中，然後用循環整個列表和頁數，使用

正則表達式，匹配評論的關鍵字，成功爬取淘寶top10口紅的評論進十萬條，如下圖所示。

技術分享

完整的源代碼如下：

from urllib import request
import re
import csv
import time
itemId=[]
sellerId=[]
links=[]
titles=[]
# ,‘商品評分‘,‘評論總數‘
def get_product_info():
    fileName = ‘商品.csv‘
    comment_file = open(fileName, ‘w‘, newline=‘‘)
    write = csv.writer(comment_file)
    write.writerow([‘商品名‘, ‘連接‘, ‘銷售量‘, ‘價格‘, ‘地址‘,‘商品評分‘,‘評論總數‘])
    comment_file.close()

    fileName2 = ‘評價.csv‘
    productfile = open(fileName2, ‘w‘, newline=‘‘)
    product_write = csv.writer(productfile)
    product_write.writerow([‘商品id‘,‘商品名‘,‘時間‘, ‘顏色分類‘, ‘評價‘])
    productfile.close()


def get_product():
    global itemId
    global sellerId
    global titles
    url = ‘https://s.taobao.com/search?q=%E5%8F%A3%E7%BA%A2&sort=sale-desc‘
    head = {}
    # 寫入User Agent信息
    head[
        ‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
    # 創建Request對象
    req = request.Request(url, headers=head)
    # 傳入創建好的Request對象
    response = request.urlopen(req, timeout=30)
    # 讀取響應信息並解碼
    html = response.read().decode(‘utf-8‘)
    # 打印信息
    pattam_id = ‘"nid":"(.*?)"‘
    raw_title = ‘"raw_title":"(.*?)"‘
    view_price = ‘"view_price":"(.*?)"‘
    view_sales = ‘"view_sales":"(.*?)"‘
    item_loc = ‘"item_loc":"(.*?)"‘
    user_id = ‘"user_id":"(.*?)"‘
    all_id = re.compile(pattam_id).findall(html)
    all_title = re.compile(raw_title).findall(html)
    all_price = re.compile(view_price).findall(html)
    all_sales = re.compile(view_sales).findall(html)
    all_loc = re.compile(item_loc).findall(html)
    all_userid = re.compile(user_id).findall(html)
    print("開始收集信息")
    try:
        for i in range(10):
            this_id = all_id[i]
            this_title = all_title[i]
            this_price = all_price[i]
            this_sales = all_sales[i]
            this_loc = all_loc[i]
            this_userid = all_userid[i]
            id = str(this_id)
            title = str(this_title)
            price = str(this_price)
            sales = str(this_sales)
            loc = str(this_loc)
            uid = str(this_userid)
            link = ‘https://item.taobao.com/item.htm?id=‘ + str(id)
            shoplink = ‘https://dsr-rate.tmall.com/list_dsr_info.htm?itemId=‘ +str(id)
            head = {}
            # 寫入User Agent信息
            head[
                ‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
            # 創建Request對象
            req2 = request.Request(shoplink, headers=head)
            # 傳入創建好的Request對象
            response2 = request.urlopen(req2, timeout=30)
            # 讀取響應信息並解碼
            html2 = response2.read().decode(‘utf-8‘)
            gradeAvg = ‘"gradeAvg":(.*?,)"‘
            rateTotal = ‘"rateTotal":(.*?,)"‘
            all_gradeAvg = re.compile(gradeAvg).findall(html2)
            all_rateTotal = re.compile(rateTotal).findall(html2)
            this_gradeAvg = all_gradeAvg
            this_rateTotal = all_rateTotal
            gradeAvg = str(this_gradeAvg)[2:-3]
            rateTotal = str(this_rateTotal)[2:-3]
            # print("平均分:" + gradeAvg)
            # print("評論總數：" + rateTotal)
            # print("商品名：" + title)
            # print("連接：" + link)
            # print("銷售量:" + sales)
            # print("價格：" + price)
            # print("地址:" + loc)
            itemId.append(id)
            sellerId.append(uid)
            titles.append(title)
            comment_file = open(‘商品.csv‘, ‘a‘, newline=‘‘)
            write = csv.writer(comment_file)
            write.writerow([title, link, sales, price, loc,gradeAvg,rateTotal])
            comment_file.close()
    except (req.ConnectionError, IndexError, UnicodeEncodeError, TimeoutError) as e:
        print(e.args)
    except response.URLError as e:
        print(e.reason)
    except IOError as e:
        print(e)
    # HTTPError
    except response.HTTPError as e:
        print(e.code)
    print("商品基本信息收集完畢")







def get_product_comment():
# 具體商品獲取評論
# 前十銷量商品
    global title
    for i in range(10):
        print("正在收集第{}件商品評論".format(str(i + 1)))
        for j in range(1,551):
            # 商品評論的url
            detaillinks="https://rate.tmall.com/list_detail_rate.htm?itemId="+itemId[i]+"&sellerId="+sellerId[i]+"&currentPage="+str(j)
            head = {}
            # 寫入User Agent信息
            head[‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
            req1 = request.Request(detaillinks, headers=head)
            # 傳入創建好的Request對象
            response1 = request.urlopen(req1,timeout=30)
            # 讀取響應信息並解碼
            html1 = response1.read().decode(‘gbk‘)
            # 打印信息
            # 評論
            rateContent = ‘"rateContent":"(.*?)"‘
            # 時間
            rateDate = ‘"rateDate":"(.*?)"‘
            # 顏色
            auctionSku = ‘"auctionSku":"(.*?)"‘
            all_date = re.compile(rateDate).findall(html1)
            all_content = re.compile(rateContent).findall(html1)
            all_sku = re.compile(auctionSku).findall(html1)
            # 獲取全部評論
            try:

                for k in range(0, len(all_content)):
                    this_date = all_date[k]
                    this_content = all_content[k]
                    this_sku = all_sku[k]
                    date = str(this_date)
                    content = str(this_content)
                    sku = str(this_sku)
                    # print("時間:" + date)
                    # print(sku)
                    # print("評價:" + content)
                    productfile = open(‘評價.csv‘, ‘a‘, newline=‘‘)
                    product_write = csv.writer(productfile)
                    product_write.writerow([itemId[i] + "\t", titles[i], date, sku, content])
                    productfile.close()
            except (req1.ConnectionError, IndexError, UnicodeEncodeError, TimeoutError) as e:
                print(e.args)
            # URLError產生的原因：網絡無連接，即本機無法上網；連接不到特定的服務器；服務器不存在
            except response1.URLError as e:
                print(e.reason)
            # HTTPError
            except response1.HTTPError as e:
                print(e.code)
            except IOError as e:
                print(e)
        print("第{}件商品評論收集完成".format(str(i+1)))


if __name__ == "__main__":
    start=time.time()
    get_product_info()
    get_product()
    # get_product_comment()
    end=time.time()
    total=end-start
    print(‘本次爬行用時:{:.2f}s!‘.format(total))
from urllib import request
import re
import csv
import time
itemId=[]
sellerId=[]
links=[]
titles=[]
# ,‘商品評分‘,‘評論總數‘
def get_product_info():
    fileName = ‘商品.csv‘
    comment_file = open(fileName, ‘w‘, newline=‘‘)
    write = csv.writer(comment_file)
    write.writerow([‘商品名‘, ‘連接‘, ‘銷售量‘, ‘價格‘, ‘地址‘,‘商品評分‘,‘評論總數‘])
    comment_file.close()

    fileName2 = ‘評價.csv‘
    productfile = open(fileName2, ‘w‘, newline=‘‘)
    product_write = csv.writer(productfile)
    product_write.writerow([‘商品id‘,‘商品名‘,‘時間‘, ‘顏色分類‘, ‘評價‘])
    productfile.close()


def get_product():
    global itemId
    global sellerId
    global titles
    url = ‘https://s.taobao.com/search?q=%E5%8F%A3%E7%BA%A2&sort=sale-desc‘
    head = {}
    # 寫入User Agent信息
    head[
        ‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
    # 創建Request對象
    req = request.Request(url, headers=head)
    # 傳入創建好的Request對象
    response = request.urlopen(req, timeout=30)
    # 讀取響應信息並解碼
    html = response.read().decode(‘utf-8‘)
    # 打印信息
    pattam_id = ‘"nid":"(.*?)"‘
    raw_title = ‘"raw_title":"(.*?)"‘
    view_price = ‘"view_price":"(.*?)"‘
    view_sales = ‘"view_sales":"(.*?)"‘
    item_loc = ‘"item_loc":"(.*?)"‘
    user_id = ‘"user_id":"(.*?)"‘
    all_id = re.compile(pattam_id).findall(html)
    all_title = re.compile(raw_title).findall(html)
    all_price = re.compile(view_price).findall(html)
    all_sales = re.compile(view_sales).findall(html)
    all_loc = re.compile(item_loc).findall(html)
    all_userid = re.compile(user_id).findall(html)
    print("開始收集信息")
    try:
        for i in range(10):
            this_id = all_id[i]
            this_title = all_title[i]
            this_price = all_price[i]
            this_sales = all_sales[i]
            this_loc = all_loc[i]
            this_userid = all_userid[i]
            id = str(this_id)
            title = str(this_title)
            price = str(this_price)
            sales = str(this_sales)
            loc = str(this_loc)
            uid = str(this_userid)
            link = ‘https://item.taobao.com/item.htm?id=‘ + str(id)
            shoplink = ‘https://dsr-rate.tmall.com/list_dsr_info.htm?itemId=‘ +str(id)
            head = {}
            # 寫入User Agent信息
            head[
                ‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
            # 創建Request對象
            req2 = request.Request(shoplink, headers=head)
            # 傳入創建好的Request對象
            response2 = request.urlopen(req2, timeout=30)
            # 讀取響應信息並解碼
            html2 = response2.read().decode(‘utf-8‘)
            gradeAvg = ‘"gradeAvg":(.*?,)"‘
            rateTotal = ‘"rateTotal":(.*?,)"‘
            all_gradeAvg = re.compile(gradeAvg).findall(html2)
            all_rateTotal = re.compile(rateTotal).findall(html2)
            this_gradeAvg = all_gradeAvg
            this_rateTotal = all_rateTotal
            gradeAvg = str(this_gradeAvg)[2:-3]
            rateTotal = str(this_rateTotal)[2:-3]
            # print("平均分:" + gradeAvg)
            # print("評論總數：" + rateTotal)
            # print("商品名：" + title)
            # print("連接：" + link)
            # print("銷售量:" + sales)
            # print("價格：" + price)
            # print("地址:" + loc)
            itemId.append(id)
            sellerId.append(uid)
            titles.append(title)
            comment_file = open(‘商品.csv‘, ‘a‘, newline=‘‘)
            write = csv.writer(comment_file)
            write.writerow([title, link, sales, price, loc,gradeAvg,rateTotal])
            comment_file.close()
    except (req.ConnectionError, IndexError, UnicodeEncodeError, TimeoutError) as e:
        print(e.args)
    except response.URLError as e:
        print(e.reason)
    except IOError as e:
        print(e)
    # HTTPError
    except response.HTTPError as e:
        print(e.code)
    print("商品基本信息收集完畢")







def get_product_comment():
# 具體商品獲取評論
# 前十銷量商品
    global title
    for i in range(10):
        print("正在收集第{}件商品評論".format(str(i + 1)))
        for j in range(1,551):
            # 商品評論的url
            detaillinks="https://rate.tmall.com/list_detail_rate.htm?itemId="+itemId[i]+"&sellerId="+sellerId[i]+"&currentPage="+str(j)
            head = {}
            # 寫入User Agent信息
            head[‘User-Agent‘] = ‘Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19‘
            req1 = request.Request(detaillinks, headers=head)
            # 傳入創建好的Request對象
            response1 = request.urlopen(req1,timeout=30)
            # 讀取響應信息並解碼
            html1 = response1.read().decode(‘gbk‘)
            # 打印信息
            # 評論
            rateContent = ‘"rateContent":"(.*?)"‘
            # 時間
            rateDate = ‘"rateDate":"(.*?)"‘
            # 顏色
            auctionSku = ‘"auctionSku":"(.*?)"‘
            all_date = re.compile(rateDate).findall(html1)
            all_content = re.compile(rateContent).findall(html1)
            all_sku = re.compile(auctionSku).findall(html1)
            # 獲取全部評論
            try:

                for k in range(0, len(all_content)):
                    this_date = all_date[k]
                    this_content = all_content[k]
                    this_sku = all_sku[k]
                    date = str(this_date)
                    content = str(this_content)
                    sku = str(this_sku)
                    # print("時間:" + date)
                    # print(sku)
                    # print("評價:" + content)
                    productfile = open(‘評價.csv‘, ‘a‘, newline=‘‘)
                    product_write = csv.writer(productfile)
                    product_write.writerow([itemId[i] + "\t", titles[i], date, sku, content])
                    productfile.close()
            except (req1.ConnectionError, IndexError, UnicodeEncodeError, TimeoutError) as e:
                print(e.args)
            # URLError產生的原因：網絡無連接，即本機無法上網；連接不到特定的服務器；服務器不存在
            except response1.URLError as e:
                print(e.reason)
            # HTTPError
            except response1.HTTPError as e:
                print(e.code)
            except IOError as e:
                print(e)
        print("第{}件商品評論收集完成".format(str(i+1)))


if __name__ == "__main__":
    start=time.time()
    get_product_info()
    get_product()
    # get_product_comment()
    end=time.time()
    total=end-start
    print(‘本次爬行用時:{:.2f}s!‘.format(total))

一個完整的大作業：淘寶口紅銷量top10的銷量和評價

gen 匹配我們 es2017 對象啟用網站 rgs cep 網站：淘寶口紅搜索頁 https://s.taobao.com/search?q=%E5%8F%A3%E7%BA%A2&sort=sale-desc先爬取該頁面前十的口紅的商品名、銷售量、價格、評分

第二次作業：淘寶軟件分析

突出產品中國增加簡單記載可能影響促銷 2.1 介紹產品相關信息你選擇的產品是？淘寶為什麽選擇該產品作為分析？淘寶是最早的網上交易平臺，對電子商務的發展起了巨大的影響，極大地影響了現代人的消費方式，也帶動了快

Android Virtualview：淘寶、天貓又開源了一個動態化、高效能的UI框架

轉載於:https://www.jianshu.com/p/5bd7a210b800https://juejin.im/post/5a4305585188257ebb73fbc9前言淘寶、天貓一直致力於解決頁面動態化的問題在2017年的4月釋出了v1.0解決方案：Tangram模型及其對應的 Androi

0512日重點：淘寶的H5手機端自適應解決方案：Flexible

自動獲取手機端 issue 解決方案 target 解決 flex get bsp 參考文檔： https://github.com/amfe/lib-flexible https://github.com/amfe/article/issues/17 自我總結：F

爬蟲：淘寶價格

import htm val bsp earch ror ret art port 1 import requests 2 import re 3 4 def getHTMLText(url): 5 try: 6 r = reques

第二次作業：支付寶案例分析

lan 註冊喜歡 hello 公交希望後來還需 right 1. 介紹產品相關信息 ?你選擇的產品是？線上支付軟件——支付寶。 ?為什麽選擇該產品作為分析？以前線上支付一般是通過銀行轉賬，操作過程復雜費時間；而當面支付需要找零，

第2次作業：支付寶快捷支付模塊分析

兩種結合蘋果什麽 png com 金錢避免標準有關支付寶快捷支付的分析 1.介紹產品相關信息　　1.1我選擇的產品是支付寶[1]。　　1.2選擇支付寶作為產品來分析主要有以下三個原因：　　　　1.2.1 相對於其他的產品，支付寶在我生活中使用的頻率會相對

第二次作業：支付寶軟件簡要分析

常用一段時間題解戰略 src use case 完成應該分享 1 介紹產品相關信息你選擇的產品是？ --支付寶為什麽選擇該產品作為分析？因為現在移動支付屬於中國的“新四大發明”，比較常見的移動支付軟件就是支付寶和微信，但是我很少用

第二次作業：支付寶手機軟件分析

預算預約有用獲取探索搭建 spa 懸浮流程　　支付寶手機軟件分析一、產品相關信息介紹　　Q：你選擇的產品是？　　A：在本次作業中，我選擇進行分析的產品是支付寶（Alipay）手機軟件（IOS版本）。　　Q：為什麽選擇該產品作為分析？　　A：在日常生活

第2次作業：支付寶案例分析

品牌建議自己技術分享銀行卡 str 階段附近教育第2次作業：支付寶案例分析你選擇的產品是？支付寶為什麽選擇該產品作為分析？ 1.相關軟件中我最為熟悉了解，並且使用率很高，能夠比較全面的使用支付寶的功能，並作出分析。 2.支付寶提出了無現金支付，

第二次作業：支付寶案例分析1

友好免費圈子風險用戶界面更多搜索框想要所有 1. 介紹產品相關信息 1.1 你選擇的產品是？第三方支付平臺------支付寶 1.2 為什麽選擇該產品作為分析？ 1.隨著人們生活水平的提高，第三方支付越來越普遍，支付寶就是其中最具有代表性的產品

第一個python爬蟲——保存淘寶mm圖片

gen with open 代號 [] 文件夾暫時觀察 python基礎意義第一次算是成功的爬蟲小代碼，花了挺長時間的。目的：　　獲取淘寶mm圖片現存問題：　　無法獲取動態加載的圖片，只能得到打開網頁後存在的圖片　　雖然更換代理仍禁止訪問收獲：　　　對爬蟲的思路

多隆：淘寶第一行代碼撰寫者的程序世界

雲效摘要：他2000年加入阿裏巴巴，是淘寶的創始人之一，是阿裏內部公認的技術大牛，在阿裏內網上他被貼得最多的標簽就是“神”。在第二屆研發效能嘉年華中他將在“向代碼致敬，尋找83行代碼”活動頒獎盛典中華麗現身，他就是“碼神”多隆，原名蔡景現，一位安安靜靜的擺渡人。他2000年加入阿裏巴巴，是淘寶的創始人之一，

期末綜合大作業：詞頻統計

ace 技術分享 nco IV style txt lam bubuko #1. bigFile = open(‘big.txt‘,mode=‘r‘,encoding=‘utf-8‘) bigText=bigFile.read() bigFile.close() pri

利用軟件無腦采集？揭秘：淘寶客拉新賺錢項目及衍生出新的賺法

shadow 也會 color 並不會 oss 掌握操作總數百萬在現在淘寶瞬息萬變的情況下，賣家們已經很難保持長時間穩定和搶先的位置，因而在淘寶運營的一些共同的技巧越來越被賣家們到處尋找，而淘寶的快速改變也從必定意思上對運營提出了更大的挑戰。因而要想在不同時期把握淘

大作業：員工資訊表

實現員工資訊表檔案儲存格式如下： id，name，age，phone，job 1,Alex,22,13651054608,IT 2,Egon,23,13304320533,Tearcher 3,nezha,25,1333235322,IT 現在需要對這個員工資訊檔案進行增刪改查。不允許一次性

阿里架構師漫談：淘寶技術架構從1.0到4.0的架構變遷！附架構資料

MySQL優化概述MySQL資料庫常見的兩個瓶頸是：CPU和I/O的瓶頸。 CPU在飽和的時候一般發生在資料裝入記憶體或從磁碟上讀取資料時候。磁碟I/O瓶頸發生在裝入資料遠大於記憶體容量的時候，如果應用分佈在網路上，那麼查詢量相當大的時候那麼平瓶頸就會出現在網路上

大作業：小型購物車系統

中秋佳節，首先在這裡給各位看到文章的大佬說一聲：中秋快樂。中秋節還在CSDN的一定都是最刻苦的，我們都是有夢想的人，我們也一定能實現我們的夢想！！程式碼多有漏洞，希望各位大佬幫忙指出，一定認真修改。作業要求：寫一個購物車系統，其中包括：1. 註冊；2. 登入；3.

專訪阿里早期投資人：淘寶、拼多多之外，電商的新出路在哪裡？

2018年，以拼多多、雲集為代表的社交電商快速爆發，向原本為流量所困的電商企業，展現了一塊新的巨大的流量窪地。從淘寶、京東到唯品會、貝貝網，不同階段和業態的電商企業，都在積極進軍社交領域。新型的關係鏈和場景，讓電商第一次將觸手伸向了四五線城市的下沉人群。社交化似乎成為線上增長的救命稻草，但社交化對於各類電

2018 HIT CSAPP 大作業：Hello的一生

目錄第1章概述 - 4 - 1.1 Hello簡介 - 4 - 1.2 環境與工具 - 4 - 1.3 中間結果 - 4 - 1.4 本章小結 - 5 - 第2章預處理 - 6 - 2.1 預處理的概念與作用 - 6 - 2.2在Ubuntu下預處理的命令 - 6 - 2.3 H

一個完整的大作業：淘寶口紅銷量top10的銷量和評價

相關推薦