1.scrapy爬取的數據保存到es中
先建立es的mapping,也就是建立在es中建立一個空的Index,代碼如下:執行後就會在es建lagou 這個index。
from datetime import datetime from elasticsearch_dsl import DocType, Date, Nested, Boolean, \ analyzer, InnerDoc, Completion, Keyword, Text, Integer from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=["localhost"])
class LagouType(DocType): # url_object_id = Keyword() url = Keyword() title = Text(analyzer="ik_max_word") salary = Keyword() job_city = Keyword() work_years = Text(analyzer="ik_max_word") degree_need = Keyword() job_type = Text(analyzer="ik_max_word") publish_time = Date() tags = Text(analyzer="ik_max_word") job_advantage = Text(analyzer="ik_max_word") job_desc = Text(analyzer="ik_max_word") job_addr = Text(analyzer="ik_max_word") company_url = Keyword() company_name = Text(analyzer="ik_max_word") crawl_time = Date()
# min_salary = Integer() # max_salary = Integer()
class Meta: index = ‘lagou‘ doc_type = "jobs"
if __name__ == "__main__": LagouType.init() |
接著在items 中定義到保存到es的代碼,代碼如下:
from lagou.models.es_type import LagouType
class LagouJobItem(scrapy.Item):
url_object_id = scrapy.Field() url = scrapy.Field() title= scrapy.Field() salary= scrapy.Field() job_city= scrapy.Field() work_years= scrapy.Field() degree_need= scrapy.Field() job_type= scrapy.Field() publish_time = scrapy.Field() tags= scrapy.Field() job_advantage= scrapy.Field() job_desc= scrapy.Field() job_addr= scrapy.Field() company_url = scrapy.Field() company_name= scrapy.Field() crawl_time= scrapy.Field() min_salary=scrapy.Field() max_salary= scrapy.Field()
def save_to_es(self): lagou_type=LagouType() lagou_type.url=self["url"] lagou_type.title=self["title"] lagou_type.salary=self["salary"] lagou_type.job_city=self["job_city"] lagou_type.work_years=self["work_years"] lagou_type.degree_need=self[‘degree_need‘] lagou_type.job_type=self[‘job_type‘] lagou_type.publish_time=self[‘publish_time‘] lagou_type.tags=self[‘tags‘] lagou_type.job_advantage=self[‘job_advantage‘] lagou_type.job_desc=self[‘job_desc‘] lagou_type.job_addr=self[‘job_addr‘] lagou_type.company_url=self[‘company_url‘] lagou_type.company_name=self[‘company_name‘] lagou_type.crawl_time=self[‘crawl_time‘] lagou_type.meta.id=self[‘url_object_id‘]
lagou_type.save()
return |
接下來就是在piplines文件中定義保存到es的pipline
class ElasticsearchPipline(object): def process_item(self, item, spider): item.save_to_es() return item |
之後就是到settings中進行設置。把這個pipline加入到item_pipline中
‘lagou.pipelines.ElasticsearchPipline‘:300
這樣就可以將爬取到的數據保存到es中
詳細說明:
elasticsearch官方也提供了一個python操作elasticsearch(搜索引擎)的接口包,就像sqlalchemy操作數據庫一樣的ORM框,這樣我們操作elasticsearch就不用寫命令了,用elasticsearch-dsl-py這個模塊來操作,也就是用python的方式操作一個類即可
elasticsearch-dsl-py下載
下載地址:https://github.com/elastic/elasticsearch-dsl-py
文檔說明:http://elasticsearch-dsl.readthedocs.io/en/latest/
首先安裝好elasticsearch-dsl-py模塊
1、elasticsearch-dsl模塊使用說明
create_connection(hosts=[‘127.0.0.1‘]):連接elasticsearch(搜索引擎)服務器方法,可以連接多臺服務器
class Meta:設置索引名稱和表名稱
索引類名稱.init(): 生成索引和表以及字段
實例化索引類.save():將數據寫入elasticsearch(搜索引擎)
from elasticsearch_dsl.connections import connections # 導入連接elasticsearch(搜索引擎)服務器方法
connections.create_connection(hosts=[‘127.0.0.1‘]) #連接到本地
class lagouType(DocType): # 自定義一個類來繼承DocType類
# Text類型需要分詞,所以需要知道中文分詞器,ik_max_wordwei為中文分詞器
title = Text(analyzer="ik_max_word") # 設置,字段名稱=字段類型,Text為字符串類型並且可以分詞建立倒排索引
description = Text(analyzer="ik_max_word")
keywords = Text(analyzer="ik_max_word")
url = Keyword() # 設置,字段名稱=字段類型,Keyword為普通字符串類型,不分詞
riqi = Date() # 設置,字段名稱=字段類型,Date日期類型
class Meta: # Meta是固定寫法
index = "lagou" # 設置索引名稱(相當於數據庫名稱)
doc_type = ‘jobs‘ # 設置表名稱
if __name__ == "__main__": # 判斷在本代碼文件執行才執行裏面的方法,其他頁面調用的則不執行裏面的方法
lagouType.init() # 生成elasticsearch(搜索引擎)的索引,表,字段等信息
1.scrapy爬取的數據保存到es中