1. 程式人生 > >用scrapy爬取京東商城的商品信息

用scrapy爬取京東商城的商品信息

keywords XML 1.5 rom toc ons lines open 3.6

軟件環境:

 1 gevent (1.2.2)
 2 greenlet (0.4.12)
 3 lxml (4.1.1)
 4 pymongo (3.6.0)
 5 pyOpenSSL (17.5.0)
 6 requests (2.18.4)
 7 Scrapy (1.5.0)
 8 SQLAlchemy (1.2.0)
 9 Twisted (17.9.0)
10 wheel (0.30.0)

1.創建爬蟲項目

2創建京東網站爬蟲. 進入爬蟲項目目錄,執行命令:

scrapy genspider jd www.jd.com

會在spiders目錄下會創建和你起的名字一樣的py文件:jd.py,這個文件就是用來寫你爬蟲的請求和響應邏輯的

3. jd.py文件配置

分析的amazon網站的url規則:
https://search.jd.com/Search?
以防關鍵字是中文,所以要做urlencode 1.首先寫一個start_request函數,用來發送第一次請求,並把請求結果發給回調函數parse_index,同時把reponse返回值傳遞給回調函數,response類型<class ‘scrapy.http.response.html.HtmlResponse‘>
 1     def start_requests(self):
 2
# https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro 3 # 拼接處符合條件的URL地址 4 # 並通過scrapy.Requst封裝請求,並調用回調函數parse_index處理,同時會把response傳遞給回調函數 6 url = https://search.jd.com/Search? 7 # 拼接的時候field-keywords後面是不加等號的 9 url += urlencode({"
keyword": self.keyword, "enc": "utf-8"}) 10 yield scrapy.Request(url, 11 callback=self.parse_index, 12 )
2.parse_index從reponse中獲取所有的產品詳情頁url地址,並遍歷所有的url地址發送request請求,同時調用回調函數parse_detail去處理結果
 1 def parse_detail(self, response):
 2     """
 3     接收parse_index的回調,並接收response返回值,並解析response
 4     :param response:
 5     :return:
 6     """
 7     jd_url = response.url
 8     sku = jd_url.split(/)[-1].strip(".html")
 9     # price信息是通過jsonp獲取,可以通過開發者工具中的script找到它的請求地址
10     price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku
11     response_price = requests.get(price_url)
12     # extraParam={"originid":"1"}  skuIds=J_3726834
13     # 這裏是物流信息的請求地址,也是通過jsonp發送的,但目前沒有找到它的參數怎麽獲取的,這個是一個固定的參數,如果有哪位大佬知道,好望指教
14     express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}"
15     response_express = requests.get(express_url)
16     response_express = json.loads(response_express.text)[stock][serviceInfo].split(>)[1].split(<)[0]
17     title = response.xpath(//*[@class="sku-name"]/text()).extract_first().strip()
18     price = json.loads(response_price.text)[0][p]
19     delivery_method = response_express
20     # # 把需要的數據保存到Item中,用來會後續儲存做準備
21     item = AmazonItem()
22     item[title] = title
23     item[price] = price
24     item[delivery_method] = delivery_method
25 
26     # 最後返回item,如果返回的數據類型是item,engine會檢測到並把返回值發給pipelines處理
27     return item

4. item.py配置

 1 import scrapy
 2 
 3 
 4 class JdItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     # amazome Item
 8     title = scrapy.Field()
 9     price = scrapy.Field()
10     delivery_method = scrapy.Field()

5. pipelines.py配置

 1 from pymongo import MongoClient
 2 
 3 
 4 class MongoPipeline(object):
 5     """
 6     用來保存數據到MongoDB的pipeline
 7     """
 8 
 9     def __init__(self, db, collection, host, port, user, pwd):
10         """
11         連接數據庫
12         :param db: databaes name
13         :param collection: table name
14         :param host: the ip for server
15         :param port: thr port for server
16         :param user: the username for login
17         :param pwd: the password for login
18         """
19         self.db = db
20         self.collection = collection
21         self.host = host
22         self.port = port
23         self.user = user
24         self.pwd = pwd
25 
26     @classmethod
27     def from_crawler(cls, crawler):
28         """
29         this classmethod is used for to get the configuration from settings
30         :param crwaler:
31         :return:
32         """
33         db = crawler.settings.get(DB)
34         collection = crawler.settings.get(COLLECTION)
35         host = crawler.settings.get(HOST)
36         port = crawler.settings.get(PORT)
37         user = crawler.settings.get(USER)
38         pwd = crawler.settings.get(PWD)
39 
40         return cls(db, collection, host, port, user, pwd)
41 
42     def open_spider(self, spider):
43         """
44         run once time when the spider is starting
45         :param spider:
46         :return:
47         """
48         # 連接數據庫
50         self.client = MongoClient("mongodb://%s:%s@%s:%s" % (
51             self.user,
52             self.pwd,
53             self.host,
54             self.port
55         ))
56 
57     def process_item(self, item, spider):
58         """
59         storage the data into database
60         :param item:
61         :param spider:
62         :return:
63         """
      # 獲取item數據,並轉換成字典格式
64 d = dict(item)
       # 有空值得不保存
65 if all(d.values()):
          # 保存到mongodb中
66 self.client[self.db][self.collection].save(d) 67 return item 68 69 # 表示將item丟棄,不會被後續pipeline處理 70 # raise DropItem()

6. 配置文件

 1 # database server
 2 DB = "jd"
 3 COLLECTION = "goods"
 4 HOST = "127.0.0.1"
 5 PORT = 27017
 6 USER = "root"
 7 PWD = "123"
 8 ITEM_PIPELINES = {
 9    MyScrapy.pipelines.MongoPipeline: 300,
10 }

技術分享圖片

技術分享圖片

用scrapy爬取京東商城的商品信息