scrapy框架爬取京東商城商品的評論

阿新 • • 發佈：2019-02-19

一、Scrapy介紹
Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。
所謂網路爬蟲，就是一個在網上到處或定向抓取資料的程式，當然，這種說法不夠專業，更專業的描述就是，抓取特定網站網頁的HTML資料。抓取網頁的一般方法是，定義一個入口頁面，然後一般一個頁面會有其他頁面的URL，於是從當前頁面獲取到這些URL加入到爬蟲的抓取佇列中，然後進入到新頁面後再遞迴的進行上述的操作，其實說來就跟深度遍歷或廣度遍歷一樣。

scrapy整體結構框架如下：

scrapy.cfg
myproject/
    __init__.py 

    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

二、爬取過程

1、建立工程tutorial
在終端命令列輸入命令scrapy startproject tutorial（tutorial為工程的名字），就建立了一個scrapy的爬蟲工程：
C:\Users\yanyan> scrapy startproject tutorial
2015-06-10 15:45:03 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
2015-06-10 15:45:03 [scrapy] INFO: Optional features available: ssl, http11
2015-06-10 15:45:03 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'tutorial' created in:
/mnt/hgfs/share/tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com

2、檢視下工程的結構
[root@bogon share]# tree tutorial/
tutorial/
├── tutorial
│ ├── __init__.py
│ ├── items.py #用於定義抽取網頁結構
│ ├── pipelines.py #將抽取的資料進行處理
│ ├── settings.py #爬蟲配置檔案
│ └── spiders
│ └── __init__.py
└── scrapy.cfg #專案配置檔案

3、定義抽取tutorial的網頁結構，修改items.py（需要抽取哪些欄位，就在items.py中定義）
這裡我們抽取如下內容：
user_name = Field() # 評論使用者的名字
user_ID = Field() # 評論使用者的ID
userProvince = Field() # 評論使用者來自的地區
content = Field() # 評論內容
good_ID = Field() # 評論的商品ID
good_name = Field() # 評論的商品名字
date = Field() # 評論時間
replyCount = Field() # 回覆數
score = Field() # 評分
status = Field() # 狀態
title = Field()
userLevelId = Field()
userRegisterTime = Field() # 使用者註冊時間
productColor = Field() # 商品顏色
productSize = Field() # 商品大小
userLevelName = Field() # 銀牌會員，鑽石會員等
userClientShow = Field() # 來自什麼比如來自京東客戶端
isMobile = Field() # 是否來自手機
days = Field() # 天數
commentTags = Field() # 標籤
具體見 https://github.com/xiaoquantou/jd_spider/tree/master/jd_spider 裡面的items.py裡的commentItem(Item)類。

4、建立spider
這個爬蟲檔案要放在..\tutorial\tutorial\spiders 目錄下。
京東商品評論的spider，具體見 https://github.com/xiaoquantou/jd_spider/tree/master/jd_spider/spiders 裡面的jd_comment.py。

spider是使用者編寫用於從單個網站(或者一些網站)爬取資料的類。
其包含了一個用於下載的初始URL，如何跟進網頁中的連結以及如何分析頁面中的內容，提取生成 item 的方法。
為了建立一個Spider，您必須繼承 scrapy.Spider 類，且定義以下三個屬性:
name: 用於區別Spider。該名字必須是唯一的，您不可以為不同的Spider設定相同的名字。
start_urls: 包含了Spider在啟動時進行爬取的url列表。因此，第一個被獲取到的頁面將是其中之一。後續的URL則從初始的URL獲取到的資料中提取。
parse(): 是spider的一個方法。被呼叫時，每個初始URL完成下載後生成的 Response 物件將會作為唯一的引數傳遞給該函式。該方法負責解析返回的資料(response data)，提取資料(生成item)以及生成需要進一步處理的URL的 Request 物件。

一個典型的spider檔案結構如下：
import scrapy

class DmozSpider(scrapy.spider.Spider):
name = "dmoz" #唯一標識，啟動spider時即指定該名稱
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)

5、修改pipelines.py檔案
pipelines.py檔案用於處理爬取下來的資料，可以儲存在資料庫中，也可以儲存在文件中，具體儲存方式，使用者可以在該檔案中自定義。
下面的例子，是將爬取的商品評論儲存成json格式：
# -*- coding: utf-8 -*-
from scrapy import signals
import json
import codecs
class JsonWithEncodingCnblogsPipeline(object):
def __init__(self):
self.file = codecs.open('tutorial.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
注意類名為JsonWithEncodingtutorialPipeline哦！settings.py中會用到

6、修改settings.py，新增以下兩個配置項
ITEM_PIPELINES = {
'tutorial.pipelines.JsonWithEncodingCnblogsPipeline': 300,
}
LOG_LEVEL = 'INFO'

7、執行spider，啟動爬蟲
在終端命令列進入上面的爬蟲工程目錄下，然後輸入命令scrapy crawl 爬蟲名稱（tutorial_spider.py中定義的name）

[root@bogon tutorial]# cd C:\Users\yanyan\tutorial
C:\Users\yanyan\tutorial> scrapy crawl comment

8、檢視結果
爬取下來的資料儲存在 ..\tutorial下面的tutorial.json（（pipelines.py中定義的名稱））檔案中，用sublime text開啟，就能看到json格式的資料

9、如果有需要可以將結果轉成txt文字格式，可參考另外一篇文章python將json格式的資料轉換成文字格式的資料或sql檔案
原始碼可在此下載：https://github.com/jackgitgz/tutorialSpider

10、從結果中取出評論和評分兩個欄位的值，分別儲存在txt和excel中，程式碼如下：
# -*- coding: utf-8 -*-
import json

# 讀取json格式資料
data = []
with open('C:\Users\yanyan\\tutorial\\tutorial.json') as f:
for line in f:
data.append(json.loads(line))
f.close()

# 儲存為txt格式的檔案
# import codecs
# file_object = codecs.open('comment.txt', 'w' ,"utf-8")
# for item in data:
# str = "%s#_#%s\r\n" % (item['content'],item['score'])
# file_object.write(str)
# file_object.close()

# 儲存為excel格式的檔案
import xlwt
file = xlwt.Workbook() #注意這裡的Workbook首字母是大寫
table = file.add_sheet('sheet 1')
# table.write(行,列,value)
row = 0
for item in data:
table.write(row,0,item['content'])
table.write(row,1,item['score'])
row += 1
file.save('comment.xlsx') #儲存檔案

參考：http://www.jianshu.com/p/a8aad3bf4dc4

http://www.cnblogs.com/rwxwsblog/p/4567052.html

https://github.com/jackgitgz/CnblogsSpider/blob/master/json2txt.py

http://blog.csdn.net/xiaoquantouer/article/details/51840332

https://github.com/xiaoquantou/jd_spider

scrapy框架爬取京東商城商品的評論

scrapy框架爬取京東商城商品的評論

利用python爬蟲爬取京東商城商品圖片

Python的網路爬蟲小系統——爬取京東商城商品資訊

爬取京東商城商品信息

畢設二:python 爬取京東的商品評論

scrapy爬取京東商城某一類商品的資訊和評論（二）

scrapy爬取京東商城某一類商品的資訊和評論（一）

用scrapy爬取京東商城的商品信息

Scrapy爬取京東商城華為全系列手機評論

教你用 Python 多執行緒爬京東商城商品評論（代理ip請閱讀上一篇）

用scrapy框架爬取映客直播用戶頭像

使用scrapy框架爬取蜂鳥論壇的攝影圖片並下載到本地

scrapy框架爬取豆瓣讀書（1）

Python：scrapy框架爬取校花網男神圖片儲存到本地

scrapy框架爬取虎撲論壇球隊新聞

python爬蟲爬取京東店鋪商品價格資料(更新版)

使用scrapy框架爬取貓眼電影全部的頁碼並寫入資料庫

scrapy框架爬取微博之spider檔案

python scrapy框架爬取豆瓣top250電影篇一代理編寫

python scrapy框架爬取豆瓣top250電影篇一儲存資料到mongogdb | mysql中

scrapy框架爬取京東商城商品的評論

相關推薦