1. 程式人生 > >python 第二周(第十一天) 我的python成長記 一個月搞定python數據挖掘!(19) -scrapy + mongo

python 第二周(第十一天) 我的python成長記 一個月搞定python數據挖掘!(19) -scrapy + mongo

msg 步驟 [0 ssi xtra tin perl overflow tab

mongoDB 3.2之後默認是使用wireTiger引擎

在啟動時更改存儲引擎:

  mongod --storageEngine mmapv1 --dbpath d:\data\db

這樣就可以解決mongvue不能查看文檔的問題啦!

項目流程(步驟):

前去準備(安裝scrapy pymongo mongodb )

 1. 生成項目目錄: scrapy startproject stack

 2.itmes   

from scrapy import Item,Field


class StackItem(Item):
title = Field()
url = Field()

 3. 創建爬蟲

from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem

class StackSpider(Spider):
name = "stack"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]

def parse(self, response):
questions = response.xpath(‘//div[@class="summary"]/h3‘)

for question in questions:
item = StackItem()
item[‘title‘] = question.xpath(
‘a[@class="question-hyperlink"]/text()‘).extract()[0]
item[‘url‘] = question.xpath(
‘a[@class="question-hyperlink"][email protected]
/* */).extract()[0]
yield item

 4.學會使用xpath selectors 進行數據的提取

 5.存儲數據到mongo中

  5.1 setting.py

ITEM_PIPELINES = {
‘stack.pipelines.MongoDBPipeline‘: 300,
}

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "stackoverflow"
MONGODB_COLLECTION = "questions"

  5.2 pipelines.py

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings[‘MONGODB_SERVER‘],
settings[‘MONGODB_PORT‘]
)
db = connection[settings[‘MONGODB_DB‘]]
self.collection = db[settings[‘MONGODB_COLLECTION‘]]

def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Question added to MongoDB database!",
level=log.DEBUG, spider=spider)

return item

 6. 啟動爬蟲 main.py

from scrapy import cmdline

cmdline.execute(‘scrapy crawl stack‘.split())

效果圖

技術分享


    

python 第二周(第十一天) 我的python成長記 一個月搞定python數據挖掘!(19) -scrapy + mongo