1. 程式人生 > >1.scrapy爬取的數據保存到es中

1.scrapy爬取的數據保存到es中

create date() city sql none tin alc set reat

先建立es的mapping,也就是建立在es中建立一個空的Index,代碼如下:執行後就會在es建lagou 這個index。

from datetime import datetime

from elasticsearch_dsl import DocType, Date, Nested, Boolean, \

analyzer, InnerDoc, Completion, Keyword, Text, Integer

from elasticsearch_dsl.connections import connections

connections.create_connection(hosts=["localhost"])

class LagouType(DocType):

# url_object_id = Keyword()

url = Keyword()

title = Text(analyzer="ik_max_word")

salary = Keyword()

job_city = Keyword()

work_years = Text(analyzer="ik_max_word")

degree_need = Keyword()

job_type = Text(analyzer="ik_max_word")

publish_time = Date()

tags = Text(analyzer="ik_max_word")

job_advantage = Text(analyzer="ik_max_word")

job_desc = Text(analyzer="ik_max_word")

job_addr = Text(analyzer="ik_max_word")

company_url = Keyword()

company_name = Text(analyzer="ik_max_word")

crawl_time = Date()

# min_salary = Integer()

# max_salary = Integer()

class Meta:

index = ‘lagou‘

doc_type = "jobs"

if __name__ == "__main__":

LagouType.init()

接著在items 中定義到保存到es的代碼,代碼如下:

from lagou.models.es_type import LagouType

class LagouJobItem(scrapy.Item):

url_object_id = scrapy.Field()

url = scrapy.Field()

title= scrapy.Field()

salary= scrapy.Field()

job_city= scrapy.Field()

work_years= scrapy.Field()

degree_need= scrapy.Field()

job_type= scrapy.Field()

publish_time = scrapy.Field()

tags= scrapy.Field()

job_advantage= scrapy.Field()

job_desc= scrapy.Field()

job_addr= scrapy.Field()

company_url = scrapy.Field()

company_name= scrapy.Field()

crawl_time= scrapy.Field()

min_salary=scrapy.Field()

max_salary= scrapy.Field()

def save_to_es(self):

lagou_type=LagouType()

lagou_type.url=self["url"]

lagou_type.title=self["title"]

lagou_type.salary=self["salary"]

lagou_type.job_city=self["job_city"]

lagou_type.work_years=self["work_years"]

lagou_type.degree_need=self[‘degree_need‘]

lagou_type.job_type=self[‘job_type‘]

lagou_type.publish_time=self[‘publish_time‘]

lagou_type.tags=self[‘tags‘]

lagou_type.job_advantage=self[‘job_advantage‘]

lagou_type.job_desc=self[‘job_desc‘]

lagou_type.job_addr=self[‘job_addr‘]

lagou_type.company_url=self[‘company_url‘]

lagou_type.company_name=self[‘company_name‘]

lagou_type.crawl_time=self[‘crawl_time‘]

lagou_type.meta.id=self[‘url_object_id‘]

lagou_type.save()

return

接下來就是在piplines文件中定義保存到espipline

class ElasticsearchPipline(object):

def process_item(self, item, spider):

item.save_to_es()

return item

之後就是到settings中進行設置。把這個pipline加入到item_pipline

‘lagou.pipelines.ElasticsearchPipline‘:300

這樣就可以將爬取到的數據保存到es

詳細說明:

elasticsearch官方也提供了一個python操作elasticsearch(搜索引擎)的接口包,就像sqlalchemy操作數據庫一樣的ORM框,這樣我們操作elasticsearch就不用寫命令了,用elasticsearch-dsl-py這個模塊來操作,也就是用python的方式操作一個類即可

elasticsearch-dsl-py下載

下載地址:https://github.com/elastic/elasticsearch-dsl-py

文檔說明:http://elasticsearch-dsl.readthedocs.io/en/latest/

首先安裝好elasticsearch-dsl-py模塊

1elasticsearch-dsl模塊使用說明

create_connection(hosts=[127.0.0.1]):連接elasticsearch(搜索引擎)服務器方法,可以連接多臺服務器

class Meta:設置索引名稱和表名稱

索引類名稱.init(): 生成索引和表以及字段

實例化索引類.save():將數據寫入elasticsearch(搜索引擎)

from elasticsearch_dsl.connections import connections # 導入連接elasticsearch(搜索引擎)服務器方法
connections.create_connection(hosts=[‘127.0.0.1‘]) #
連接到本地

class lagouType(DocType): # 自定義一個類來繼承DocType
# Text
類型需要分詞,所以需要知道中文分詞器,ik_max_wordwei為中文分詞器
title = Text(analyzer="ik_max_word") #
設置,字段名稱=字段類型,Text為字符串類型並且可以分詞建立倒排索引
description = Text(analyzer="ik_max_word")
keywords = Text(analyzer="ik_max_word")
url = Keyword() #
設置,字段名稱=字段類型,Keyword為普通字符串類型,不分詞
riqi = Date() #
設置,字段名稱=字段類型,Date日期類型

class Meta: # Meta是固定寫法
index = "lagou" #
設置索引名稱(相當於數據庫名稱)
doc_type = ‘
jobs‘ # 設置表名稱

if __name__ == "__main__": # 判斷在本代碼文件執行才執行裏面的方法,其他頁面調用的則不執行裏面的方法
lagouType.init() #
生成elasticsearch(搜索引擎)的索引,表,字段等信息

1.scrapy爬取的數據保存到es中