用scrapy爬取京東商城的商品信息
阿新 • • 發佈:2018-01-25
keywords XML 1.5 rom toc ons lines open 3.6 2創建京東網站爬蟲. 進入爬蟲項目目錄,執行命令:
軟件環境:
1 gevent (1.2.2) 2 greenlet (0.4.12) 3 lxml (4.1.1) 4 pymongo (3.6.0) 5 pyOpenSSL (17.5.0) 6 requests (2.18.4) 7 Scrapy (1.5.0) 8 SQLAlchemy (1.2.0) 9 Twisted (17.9.0) 10 wheel (0.30.0)
1.創建爬蟲項目
2創建京東網站爬蟲. 進入爬蟲項目目錄,執行命令:
scrapy genspider jd www.jd.com
會在spiders目錄下會創建和你起的名字一樣的py文件:jd.py,這個文件就是用來寫你爬蟲的請求和響應邏輯的
3. jd.py文件配置
分析的amazon網站的url規則:https://search.jd.com/Search?以防關鍵字是中文,所以要做urlencode 1.首先寫一個start_request函數,用來發送第一次請求,並把請求結果發給回調函數parse_index,同時把reponse返回值傳遞給回調函數,response類型<class ‘scrapy.http.response.html.HtmlResponse‘>
1 def start_requests(self): 22.parse_index從reponse中獲取所有的產品詳情頁url地址,並遍歷所有的url地址發送request請求,同時調用回調函數parse_detail去處理結果# https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro 3 # 拼接處符合條件的URL地址 4 # 並通過scrapy.Requst封裝請求,並調用回調函數parse_index處理,同時會把response傳遞給回調函數 6 url = ‘https://search.jd.com/Search?‘ 7 # 拼接的時候field-keywords後面是不加等號的 9 url += urlencode({"keyword": self.keyword, "enc": "utf-8"}) 10 yield scrapy.Request(url, 11 callback=self.parse_index, 12 )
1 def parse_detail(self, response): 2 """ 3 接收parse_index的回調,並接收response返回值,並解析response 4 :param response: 5 :return: 6 """ 7 jd_url = response.url 8 sku = jd_url.split(‘/‘)[-1].strip(".html") 9 # price信息是通過jsonp獲取,可以通過開發者工具中的script找到它的請求地址 10 price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku 11 response_price = requests.get(price_url) 12 # extraParam={"originid":"1"} skuIds=J_3726834 13 # 這裏是物流信息的請求地址,也是通過jsonp發送的,但目前沒有找到它的參數怎麽獲取的,這個是一個固定的參數,如果有哪位大佬知道,好望指教 14 express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}" 15 response_express = requests.get(express_url) 16 response_express = json.loads(response_express.text)[‘stock‘][‘serviceInfo‘].split(‘>‘)[1].split(‘<‘)[0] 17 title = response.xpath(‘//*[@class="sku-name"]/text()‘).extract_first().strip() 18 price = json.loads(response_price.text)[0][‘p‘] 19 delivery_method = response_express 20 # # 把需要的數據保存到Item中,用來會後續儲存做準備 21 item = AmazonItem() 22 item[‘title‘] = title 23 item[‘price‘] = price 24 item[‘delivery_method‘] = delivery_method 25 26 # 最後返回item,如果返回的數據類型是item,engine會檢測到並把返回值發給pipelines處理 27 return item
4. item.py配置
1 import scrapy 2 3 4 class JdItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 # amazome Item 8 title = scrapy.Field() 9 price = scrapy.Field() 10 delivery_method = scrapy.Field()
5. pipelines.py配置
1 from pymongo import MongoClient 2 3 4 class MongoPipeline(object): 5 """ 6 用來保存數據到MongoDB的pipeline 7 """ 8 9 def __init__(self, db, collection, host, port, user, pwd): 10 """ 11 連接數據庫 12 :param db: databaes name 13 :param collection: table name 14 :param host: the ip for server 15 :param port: thr port for server 16 :param user: the username for login 17 :param pwd: the password for login 18 """ 19 self.db = db 20 self.collection = collection 21 self.host = host 22 self.port = port 23 self.user = user 24 self.pwd = pwd 25 26 @classmethod 27 def from_crawler(cls, crawler): 28 """ 29 this classmethod is used for to get the configuration from settings 30 :param crwaler: 31 :return: 32 """ 33 db = crawler.settings.get(‘DB‘) 34 collection = crawler.settings.get(‘COLLECTION‘) 35 host = crawler.settings.get(‘HOST‘) 36 port = crawler.settings.get(‘PORT‘) 37 user = crawler.settings.get(‘USER‘) 38 pwd = crawler.settings.get(‘PWD‘) 39 40 return cls(db, collection, host, port, user, pwd) 41 42 def open_spider(self, spider): 43 """ 44 run once time when the spider is starting 45 :param spider: 46 :return: 47 """ 48 # 連接數據庫 50 self.client = MongoClient("mongodb://%s:%s@%s:%s" % ( 51 self.user, 52 self.pwd, 53 self.host, 54 self.port 55 )) 56 57 def process_item(self, item, spider): 58 """ 59 storage the data into database 60 :param item: 61 :param spider: 62 :return: 63 """
# 獲取item數據,並轉換成字典格式
64 d = dict(item)
# 有空值得不保存 65 if all(d.values()):
# 保存到mongodb中 66 self.client[self.db][self.collection].save(d) 67 return item 68 69 # 表示將item丟棄,不會被後續pipeline處理 70 # raise DropItem()
6. 配置文件
1 # database server 2 DB = "jd" 3 COLLECTION = "goods" 4 HOST = "127.0.0.1" 5 PORT = 27017 6 USER = "root" 7 PWD = "123" 8 ITEM_PIPELINES = { 9 ‘MyScrapy.pipelines.MongoPipeline‘: 300, 10 }
用scrapy爬取京東商城的商品信息