1. 程式人生 > >Python爬蟲系列:京東商品爬蟲

Python爬蟲系列:京東商品爬蟲

需求:爬取京東手機頻道的手機商品資訊:名稱、價格、評論數、商家名稱等
這裡涉及2個問題需要解決。
1、手機圖片的爬取和儲存
2、手機價格的爬取與儲存(因為手機價格是非同步載入的,無法從網頁原始碼中直接獲取)

圖片的爬取和儲存

import requests
url="https://img13.360buyimg.com/n7/jfs/t3391/79/1963324994/297093/187de6d4/583ced0fN27e50577.jpg"
res=requests.get(url)

with open("E:\\jupyter-notebook\\PyCrawler\\jd1.jpg","wb"
) as fd: fd.write(res.content)

非同步載入的資料-以京東商城價格資訊提取為例

import re
url="https://p.3.cn/prices/mgets?callback=jQuery6775278&skuids=J_5089253"
res=requests.get(url)
pat='"p":"(.*?)"}'
price=re.compile(pat).findall(res.text)
print(price)

京東手機圖片採集

url="https://list.jd.com/list.html?cat=9987,653,655"
res=requests.get(url) imagepat='<img width="220" height="220" data-img="1" data-lazy-img="//(.*?)">' imagelist=re.compile(imagepat).findall(res.text) print(imagelist) x=1 for imageurl in imagelist: imagename="E:\\jupyter-notebook\\PyCrawler\\jdpic\\"+str(x)+".jpg" x+=1 imageurl="http://"
+imageurl res=requests.get(imageurl) with open(imagename,'wb') as fd: fd.write(res.content)

完整程式碼如下

#京東手機資訊採集:名稱、價格、評論數、商家名稱等
import requests
from lxml import etree
from pandas import DataFrame
import pandas as pd

jdInfoAll=DataFrame()
for i in range(1,4):
    url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(i)
    res=requests.get(url)
    res.encoding='utf-8'
    root=etree.HTML(res.text)
    name=root.xpath('//li[@class="gl-item"]//div[@class="p-name"]/a/em/text()')
    for i in range(0,len(name)):
        name[i]=re.sub('\s','',name[i])

    #sku
    sku=root.xpath('//li[@class="gl-item"]/div/@data-sku')

    #價格
    price=[]
    comment=[]
    for i in range(0,len(sku)):
        thissku=sku[i]
        priceurl="https://p.3.cn/prices/mgets?callback=jQuery6775278&skuids=J_"+str(thissku)
        pricedata=requests.get(priceurl)
        pricepat='"p":"(.*?)"}'
        thisprice=re.compile(pricepat).findall(pricedata.text)   
        price=price+thisprice

        commenturl="https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds="+str(thissku)
        commentdata=requests.get(commenturl)
        commentpat='"CommentCount":(.*?),"'
        thiscomment=re.compile(commentpat).findall(commentdata.text)
        comment=comment+thiscomment

    #商家名稱
    shopname=root.xpath('//li[@class="gl-item"]//div[@class="p-shop"]/@data-shop_name')
    print(shopname)

    jdInfo=DataFrame([name,price,shopname,comment]).T
    jdInfo.columns=['產品名稱','價格','商家名稱','評論數']
    jdInfoAll=pd.concat([jdInfoAll,jdInfo])
jdInfoAll.to_excel('jdInfoAll.xls')