Python基於Elasticsearch實現搜尋引擎

阿新 • • 發佈：2019-01-03

＆nbsp; ＆NBSP; ＆NBSP; ＆NBSP; ElasticSearch是一個基於Lucene的搜尋伺服器。它提供了一個分散式多使用者能力的全文搜尋引擎，基於RESTful Web介面.Elasticsearch是用Java開發的，並作為Apache許可條款的開放原始碼釋出，是當前流行的企業級搜尋引擎設計用於雲端計算中，能夠達到實時搜尋，穩定，可靠，快速，安裝使用方便。

1.準備工作

已經安裝elasticsearch-rtf >>>安裝教程
已經安裝elasticsearch-head >>>安裝教程
一個完美的爬蟲專案

2.填充資料

要想製作一款搜尋引擎，首先資料庫裡面得有大量的資料，如果資料庫裡面都沒有資料，那這個搜尋引擎還能教搜尋引擎嗎？所以我們先來爬取大量的資料，這裡寫了一個小說網的爬蟲，以搜尋小說為例。

編寫model.py檔案，編寫完畢呼叫init函式，建立es索引的mapping

#coding:utf-8
from elasticsearch_dsl import DocType,Completion,Text,Boolean,Integer,Date
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl.analysis import CustomAnalyzer

# 1.建立個ES連線
connections.create_connection(hosts=['127.0.0.1'])

# 3.自定義分詞器
class MyAnalyzer(CustomAnalyzer):
    def get_analysis_definition(self):
        return {}

# 建立分析器物件   filter 忽略大小寫
ik_analyzer = MyAnalyzer('ik_max_word',filter=['lowercase'])

# 2.建立資料Model
class NovelModel(DocType):
    # 2.1普通欄位
    title = Text(analyzer='ik_max_word')
    author = Text(analyzer='ik_max_word')
    classify = Text()
    rate = Text()
    collect = Integer()
    number = Text()
    time = Text()
    click_week = Integer()
    click_month = Integer()
    click_all = Integer()
    collect_week = Integer()
    collect_month = Integer()
    collect_all = Integer()
    abstract = Text()
    picture = Text()
    download_url = Text()
    # 2.2搜尋建議欄位
    suggest = Completion(analyzer=ik_analyzer)
    # 2.3建立Meta
    class Meta:
        # index 索引名(資料庫)
        index = 'alldata'
        # doc_type 型別(表名稱)
        doc_type = 'novel'

if __name__ == '__main__':
    NovelModel.init()

寫一個Pipeline來儲存資料

因為考慮到一個爬蟲專案可能不止一個爬蟲，每個爬蟲的Item又不一樣，所以在每一個Item類中來進行寫入儲存操作，然後每次當Item交給Pipeline來處理的時候，會根據不同的Item來進行不同的處理操作。
```
|-pipelines檔案
class ToEsPipeline(object):
    def process_item(self,item,spider):
    item.save_to_es()
    return item
```

編寫Item

import scrapy
from elasticsearch_dsl.connections import connections
from .es_model import NovelModel

# 1.建立連線，獲得連線物件
es = connections.create_connection(hosts=['http://39.107.255.196'])

# 3.處理搜尋意見分詞
def process_suggest(index,*args):
        '''
        :param index: index 索引(資料庫)
        :param args: 需要進行分詞的內容
        :return: 返回分詞之後的列表,不允許有重複的資料
        '''
        #建立一個空集合
        use_words = set()
        #宣告搜尋建議分詞列表
        suggest = []
        for text,weight in args:
            # text 需要分詞的文字
            # weight 權重
            # 呼叫es的分詞analyzer介面進行分詞
            words = es.indices.analyze(
                # es索引(資料庫)
                index = index,
                analyzer='ik_max_word',
                # 其他引數,顧慮器
                params={
                    'filter':['lowercase'],
                },
                body={
                    'text':text
                }
            )
            # 列表生成式 並轉換set集合進行去重
            analyzer_words = set([dic['token'] for dic in words['tokens']])
            new_words = analyzer_words - use_words
            #把沒有重複的資料追加到列表
            suggest.append({'input':list(new_words),'weight':weight})
            use_words = analyzer_words

        return suggest

# 2.處理Item
class MyItem(scrapy.Item):
    novel_classify = scrapy.Field()
    novel_title = scrapy.Field()
    novel_author = scrapy.Field()
    novel_rate = scrapy.Field()
    novel_collect = scrapy.Field()
    novel_number = scrapy.Field()
    novel_time = scrapy.Field()
    click_all = scrapy.Field()
    click_month = scrapy.Field()
    click_week = scrapy.Field()
    collect_all = scrapy.Field()
    collect_month = scrapy.Field()
    collect_week = scrapy.Field()
    novel_abstract = scrapy.Field()
    novel_picture = scrapy.Field()
    novel_download = scrapy.Field()

    # 2.建立儲存方法
    def save_to_es(self):
        # 2.1建立Novel資料Model物件
        novel = NovelModel()
        # 2.2普通欄位賦值
        novel.title = self['novel_title']
        novel.author = self['novel_author']
        novel.classify = self['novel_classify']
        novel.rate = self['novel_rate']
        novel.collect = self['novel_collect']
        novel.number = self['novel_number']
        novel.time = self['novel_time']
        novel.click_week = self['click_week']
        novel.click_month = self['click_month']
        novel.click_all = self['click_all']
        novel.collect_week = self['collect_week']
        novel.collect_month = self['collect_month']
        novel.collect_all = self['collect_all']
        novel.bstract = self['novel_abstract']
        novel.picture = self['novel_picture']
        novel.download_url = self['novel_download']
        # 2.3搜尋建議
        novel.suggest = process_suggest(NovelModel._doc_type.index,(novel.title,10),(novel.author,8))
        # 2.4儲存
        novel.save()

3.Django專案

由於在Django專案中也會用到我們在scrapy爬蟲專案中的model.py檔案，所以複製一份到django專案中

import math
from redis import Redis
from urllib import parse
from datetime import datetime
from django.shortcuts import render, redirect
from django.http import JsonResponse
from elasticsearch_dsl.connections import connections

from .es_models.es_types import NovelModel

rds = Redis(host='127.0.0.1',port=6379)
es = connections.create_connection(hosts=['127.0.0.1'])

def index(request):
    # 定義搜搜哦資料的型別
    navs = [
        {'type': 'novel', 'title': '小說'},
        {'type': 'movie', 'title': '電影'},
        {'type': 'job', 'title': '職位'},
        {'type': 'news', 'title': '新聞'},
    ]

    content = {
        'navs': navs,
        'search_type': 'novel'
    }

    if request.method == 'GET':
        return render(request, 'index.html', content)


def result(request):
    if request.method == 'GET':
        # 取出關鍵詞,型別
        keyword = request.GET.get('kw')
        s_type = request.GET.get('s_type')
        # 如果沒有頁碼引數,預設為1
        page_num = request.GET.get('pn', 1)
        # 如果沒有搜尋關鍵詞,重定向到主頁
        if not keyword:
            return redirect('index')

        rds.zincrby('hotkey',keyword)
        hot_top5 = rds.zrevrange('hotkey',0,5)

        history = request.COOKIES.get('history',None)
        cookie_str = ''
        if history:
            cookies = history.split(',')
            if parse.quote(keyword) in cookies:
                cookies.remove(parse.quote(keyword))
            cookies.insert(0,parse.quote(keyword))
            if len(cookies) > 5:
                cookies.pop()
            cookie_str = ','.join(cookies)
        else:
            cookies = []
            cookie_str = parse.quote(keyword)

        # 判斷搜尋型別
        if s_type == 'novel':
            # 1.搜尋的索引
            index = 'alldata'
            # 2.type名
            doc_type = 'novel'
            # 3.獲取資料欄位
            fields = ['title', 'bstract']
            start_time = datetime.now()
            rs = es.search(
                index=index,
                doc_type=doc_type,
                body={
                    "query": {
                        "multi_match": {
                            "query": keyword,
                            "fields": fields
                        }
                    },
                    "from": (int(page_num) - 1) * 10,
                    "size": 10,
                    'highlight': {
                        'pre_tags': ['<span class="keyWord">'],
                        "post_tags": ['</span>'],
                        "fields": {
                            "title": {},
                            "bstract": {}
                        }
                    }
                }
            )
            use_time = (datetime.now() - start_time).total_seconds()
            hits_list = []
            for hit in rs['hits']['hits']:
                h_dic = {}
                if 'title' in hit['highlight'].keys():
                    h_dic['title'] = hit['highlight']['title'][0]
                else:
                    h_dic['title'] = hit['_source']['title']
                if 'bstract' in hit['highlight'].keys():
                    h_dic['abstract'] = hit['highlight']['bstract']
                else:
                    h_dic['abstract'] = hit['_source']['bstract']
                h_dic['detail_url'] = hit['_source']['download_url'][0]

                hits_list.append(h_dic)

            navs = [
                {'type': 'novel', 'title': '部落格'},
                {'type': 'job', 'title': '職位'},
                {'type': 'movie', 'title': '電影'},
                {'type': 'news', 'title': '新聞'},
            ]

            # 總記錄條數
            totle = rs['hits']['total']
            # 頁數,向上取證
            page_nums = math.ceil(totle / 10)

            page_num = int(page_num)
            if page_num - 4 <= 0:
                pages = range(1, 11)
            elif page_num + 5 >= page_nums:
                pages = range(page_nums - 9, page_nums + 1)
            else:
                pages = range(page_num - 4, page_num + 6)


            content = {
                'hits': hits_list,
                'kw': keyword,
                'use_time': use_time,
                'total': totle,
                'page_nums': page_nums,
                'navs': navs,
                'search_type': s_type,
                'pages': pages,
                'history':[his for his in  parse.unquote(cookie_str).split(',')],
                'hot_top5':hot_top5
            }
        response = render(request,'result.html',content)
        response.set_cookie('history',cookie_str)

        return response


def suggest(request):
    if request.method == 'GET':
        # 取出搜尋內容、型別
        s = request.GET.get('s', None)
        s_type = request.GET.get('s_type')
        content = {}
        if s:
            # 去ES中根據搜尋關鍵詞、搜尋型別
            datas = get_suggest(s, s_type)
            content['status'] = 0
            content['datas'] = datas
            content['s_type'] = s_type
            if len(datas) == 0:
                content['status'] = -1
        else:
            content['status'] = -1

        return JsonResponse(content)


# 在es中搜索資料
def get_suggest(keyword, s_type):
    '''
    :param keyword: 搜尋關鍵詞
    :param s_type: 搜尋型別
    :return: 搜尋結果
    '''
    # 建立一個search物件用於搜尋
    if s_type == 'novel':
        search = NovelModel.search()
    elif s_type == 'job':
        pass
    # suggest()獲取搜尋建議的介面
    # 1.自定義搜尋結果對應的key
    # 2.搜尋關鍵詞
    result = search.suggest(
        'r_suggest',
        keyword,
        completion={
            'field': 'suggest',
            'fuzzy': {
                'fuzziness': 2
            },
            'size': 5
        }
    )
    # s返回一個字典
    s = result.execute_suggest()
    fileds = {'novel': 'title'}
    # 定義一個結果列表
    datas = []
    for dic in s['r_suggest'][0]['options']:
        sug = dic._source[fileds[s_type]]
        datas.append(sug)

    # 返回搜尋建議
    return datas

Python基於Elasticsearch實現搜尋引擎

＆nbsp; ＆NBSP; ＆NBSP; ＆NBSP; ElasticSearch是一個基於Lucene的搜尋伺服器。它提供了一個分散式多使用者能力的全文搜尋引擎，基於RESTful Web介面.Elasticsearch是用Java開發的，並作為Apac

PHP-elasticsearch配置+基於elasticsearch全文搜尋引擎的開發小結

首先參照官網內容下載與自己php以及elasticsearch版本相匹配的Php-elasticsearch，按照官網內容進行配置https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/index.ht

分類問題（一）：SVM(Python——基於skearn實現鳶尾花資料集分類)

第一步： # -*- coding: utf-8 -*- """ Created on Fri Sep 21 14:26:25 2018 @author: bd04 """ # !/usr/bin/env python # encoding: utf-8 __auth

利用elasticsearch實現搜尋引擎

ElasticSearch是一個基於Lucene的搜尋伺服器。它提供了一個分散式多使用者能力的全文搜尋引擎，基於RESTful web介面。Elasticsearch是用Java開發的，並作為Apache許可條款下的開放原始碼釋出，是第二最流行的企業搜

基於ElasticSearch實現商品的全文檢索檢索

全文檢索相關概念資料分類；結構化資料：行資料，儲存在資料庫裡，可以用二維表結構來邏輯表達實現的資料能夠用資料或統一的結構加以表示，可以用數字和符號加以表示非結構化資料：無法用數字或統一的結構表示文字、影象、聲音、網頁

基於Elasticsearch實現搜尋建議

搜尋建議是搜尋的一個重要組成部分，一個搜尋建議的實現通常需要考慮建議詞的來源、匹配、排序、聚合、關聯的文件數和拼寫糾錯等，本文介紹一個基於Elasticsearch實現的搜尋建議。問題描述電商網站的搜尋是最基礎最重要的功能之一，搜尋框上面的良好體驗能

ElasticSearch學習29_基於Elasticsearch實現搜尋推薦

在基於Elasticsearch實現搜尋建議一文中我們曾經介紹過如何基於Elasticsearch來實現搜尋建議，而本文是在此基於上進一步優化搜尋體驗，在當搜尋無結果或結果過少時提供推薦搜尋詞給使用者。背景介紹在根據使用者輸入和篩選條件進行搜尋後，有時返回的是無

Python基於Socket實現簡易多人聊天室

##前言套接字(Sockets)是雙向通訊通道的端點。套接字可以在一個程序內，在同一機器上的程序之間，或者在不同主機的程序之間進行通訊，主機可以是任何一臺有連線網際網路的機器。套接字可以通過多種不同的通道型別實現：Unix域套接字，TCP，UDP等。套接字型檔提供了處理公共傳輸的特定類，以及一個用於處

Python基於中文分詞的簡單搜尋引擎實現 Whoosh

# -*- coding: utf-8 -*- """ Created on Tue Nov 13 22:53:33 2018 @author: Lenovo """ from whoosh.fields import Schema,TEXT,ID from jieba.analyse imp

Python分散式爬蟲打造搜尋引擎完整版-基於Scrapy、Redis、elasticsearch和django打造一個完整的搜尋引擎網站

Python分散式爬蟲打造搜尋引擎基於Scrapy、Redis、elasticsearch和django打造一個完整的搜尋引擎網站https://github.com/mtianyan/ArticleSpider 未來是什麼時代？是資料時代！資料分析服務、網際網路金融，資

python--基於RabbitMQ rpc實現的主機管理

input == pub tag 3.4 num {} 配置 local 要求：可以異步的執行多個命令對多臺機器>>:run "df -h" --hosts 192.168.3.55 10.4.3.4task id: 45334>>: check_

Python-基於socket和select模塊實現IO多路復用

con style 不同使用 encoding 但是通過 append 出現 ‘‘‘IO指的是輸入輸出，一部分指的是文件操作，還有一部分網絡傳輸操作，例如soekct就是其中之一；多路復用指的是利用一種機制，同時使用多個IO，例如同時監聽多個文件句柄（socket對象一

python基於併發與socket實現遠端檔案傳輸程式

FTP程式 Client: * bin/start.py 程式入口 * conf/配置檔案存放 * core/ * auth.py 登陸，註冊以及上傳下載檢視當前資料夾下檔案以及刪除功能存放 * cline.py 與服務端通訊 * home 本地使用者目錄 Server： * bin/

Python基於K-均值、RLS演算法實現RBF神經網路（神經網路與機器學習第五章計算機實驗）

1、生成資料集 class moon_data_class(object): def __init__(self,N,d,r,w): self.N=N self.w=w self.d=d self.r=r

基於python、face++實現人臉檢測、人臉識別

face++做的人臉識別應該是我目前接觸到的效果最好的了。這家公司也是個獨角獸，專門做人臉識別這塊的，返回的照片的資訊很全，也好呼叫。不過只提供線上的，不提供離線sdk沒法做一些實時性的東西。 import requests from json import JSONDecoder import

Python基於皮爾遜系數實現股票預測（多線程）

author top def split pat init -s bubuko odi 1 # -*- coding: utf-8 -*- 2 """ 3 Created on Tue Dec 4 08:53:08 2018 4 5 @a

python利用Trie(字首樹)實現搜尋引擎中關鍵字輸入提示（學習Hash Trie和Double-array Trie）

python利用Trie(字首樹)實現搜尋引擎中關鍵字輸入提示（學習Hash Trie和Double-array Trie）主要包括兩部分內容：（1）利用python中的dict實現Trie；（2）按照darts-java的方法做python的實現Double-array Trie比較：（1）

Python基於類和物件實現的決鬥遊戲

需求：基本任務： 1 建立角色類，角色擁有生命值的屬性和攻擊的方法，攻擊值是隨機的。 2 建立職業子類，刀客，（傷害少，血量多）劍客（傷害正常，血量正常，有機率兩倍暴擊），女賊（傷害高，血量少，有機率 3 倍暴擊） 3 歡迎介面，選擇職業，建立角色，替電腦

LFTP Project Report——基於UDP實現TCP大檔案傳輸（Python）

LFTP Project Report 中山大學資料科學與計算機學院軟體工程(計算機應用方向) 16340132 樑穎霖一.專案要求 Please choose one of following programing languages: C, C++,

Python基於單例模式實現具有時效性的記憶體快取

Python基於單例模式實現具有時效性的記憶體快取版本說明：Python 2.7 Python有不少第三方的快取庫，如cacheout、memcached等。因為專案需求，這裡不使用第三方庫，自己實現具有時效性的記憶體快取，用來快取重複利用的資料。 1 設計實現

Python基於Elasticsearch實現搜尋引擎

1.準備工作

2.填充資料

3.Django專案

相關推薦