關於站內搜尋的那些事兒

阿新 • • 發佈：2019-01-19

- 前言
- 演示
  - 案例一
  - 案例二
- 總結

前言

然後現在用的是Python了，所以需要迭代一下。網上搜索了下，相關的還真不少，還有pylucene，但是相比較而言，whoosh更為出色。那今天就用它吧。

安裝它也比較簡單。

pip install whoosh

這樣就可以了。

目標：對自己的部落格進行“站內搜尋”，來稍微改善一下CSDN站內查詢的缺點。

模組化

最近越來越喜歡把任務模組化了，這樣單個的功能也比較容易管理，而且整合的時候對整合測試也比較方便。或者新增新功能，重構，都很方便。

針對上面的需求，我這裡設計了幾個小模組，待會逐個進行解釋。

登入模組

登入模組是有點必須的，這是因為在獲取部落格詳細內容的時候，需要有一個已經登入的session會話來支撐，否則拿不到資料。

先前也寫過一點關於CSDN模擬登陸的例子，當時完成的功能有

模擬登陸
頂、踩文章
發評論
獲取博主詳情

為了不讓別有用心的人拿程式碼做壞事，我這裡就不貼程式碼了。技術方面歡迎私信，或者在文章下面發評論。

下面把模擬登陸的程式碼補上。

class Login(object):
    """
    Get the same session for blog's backing up. Need the special username and password of your account.
    """ 


    def __init__(self):
        # the common headers for this login operation.
        self.headers = {
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def 
 login(self, username, password):
        if username and password:
            self.username = username
            self.password = password
        else:
            raise Exception('Need Your username and password!')

        loginurl = 'https://passport.csdn.net/account/login'
        # get the 'token' for webflow
        self.session = requests.Session()
        response = self.session.get(url=loginurl, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assemble the data for posting operation used in logining.
        self.token = soup.find('input', {'name': 'lt'})['value']

        payload = {
            'username': self.username,
            'password': self.password,
            'lt': self.token,
            'execution': soup.find('input', {'name': 'execution'})['value'],
            '_eventId': 'submit'
        }
        response = self.session.post(url=loginurl, data=payload, headers=self.headers)

        # get the session
        return self.session if response.status_code == 200 else None

部落格掃描模組

部落格掃描這個模組不需要登入狀態的支援，完成的功能是掃描博主的文章總數，以及每個文章對應的URL連結。因為接下來會用它來獲取文章的詳情。

class BlogScanner(object):
    """
    Scan for all blogs
    """

    def __init__(self, domain):
        self.username = domain
        self.rooturl = 'http://blog.csdn.net'
        self.bloglinks = []
        self.headers = {
            'Host': 'blog.csdn.net',
            'Upgrade - Insecure - Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def scan(self):
        # get the page count
        response = requests.get(url=self.rooturl + "/" + self.username, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        pagecontainer = soup.find('div', {'class': 'pagelist'})
        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]

        # construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
        for index in range(1, int(pages) + 1):
            # get the blog link of each list page
            listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
            response = requests.get(url=listurl, headers=self.headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            try:
                alinks = soup.find_all('span', {'class': 'link_title'})
                # print(alinks)
                for alink in alinks:
                    link = alink.find('a').attrs['href']
                    link = self.rooturl + link
                    self.bloglinks.append(link)
            except Exception as e:
                print('出現了點意外！\n' + e)
                continue

        return self.bloglinks

部落格詳情模組

關於部落格詳情，我倒是覺得CSDN做的真不賴。而且是json格式的。話不多說，看下登入狀態下能獲取到的部落格的詳細內容吧。

這下思路很清晰了，就是要獲取標題，URL，標籤，摘要描述，文章正文內容。程式碼如下：

class BlogDetails(object):
    """
    Get the special url for getting markdown file.
    'url':部落格URL
    'title': 部落格標題
    'tags': 部落格附屬標籤
    'description': 部落格摘要描述資訊
    'content': 部落格Markdown原始碼
    """

    def __init__(self, session, blogurl):
        self.headers = {
            'Referer': 'http://write.blog.csdn.net/mdeditor',
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }
        # constructor the url: get article id and the username
        # http://blog.csdn.net/marksinoberg/article/details/70432419
        username, id = blogurl.split('/')[3], blogurl.split('/')[-1]
        self.blogurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        self.session = session

    def getSource(self):
        # get title and content for the assigned url.
        try:
            tempheaders = self.headers
            tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
            tempheaders['Host'] = 'write.blog.csdn.net'
            tempheaders['X-Requested-With'] = 'XMLHttpRequest'
            response = self.session.get(url=self.blogurl, headers=tempheaders)
            soup = json.loads(response.text)
            return {
                'url': soup['data']['url'],
                'title': soup['data']['title'],
                'tags': soup['data']['tags'],
                'description': soup['data']['description'],
                'content': soup['data']['markdowncontent'],
            }
        except Exception as e:
            print("介面請求失敗! 詳細資訊為：{}".format(e))

搜尋模組

搜尋模組是今天的核心，使用到的庫就是whoosh，真的是很貼心的一個庫，而且文件詳細，簡單易懂。我這蹩腳的英文水平都可以，你也一定可以的。

預設的文字分析器是英文的，所以為了更好的照顧到中文相關，就得處理一下中文分詞，於是在網上抄了一個，不過效果不咋地。

class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False, keeporiginal=False,
                 removestops=True, start_pos=0, start_char=0, mode='', **kwargs):
        assert isinstance(value, text_type), "%r is not unicode"%value
        t = Token(positions=positions, chars=chars, removestops=removestops, mode=mode, **kwargs)
        # 使用jieba分詞，分解中文
        seglist = jieba.cut(value, cut_all=False)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_pos + value.find(w) + len(w)
            yield t

def ChineseAnalyzer():
    return ChineseTokenizer()


class Searcher(object):
    """
    Firstly： define a schema suitable for this system. It may should be hard-coded.
            'url':部落格URL
            'title': 部落格標題
            'tags': 部落格附屬標籤
            'description': 部落格摘要描述資訊
            'content': 部落格Markdown原始碼
    Secondly: add documents(blogs)
    Thridly: search user's query string and return suitable high score blog's paths.
    """
    def __init__(self):
        # define a suitable schema
        self.schema = Schema(url=ID(stored=True),
                             title=TEXT(stored=True),
                             tags=KEYWORD(commas=True),
                             description=TEXT(stored=True),
                             content=TEXT(analyzer=ChineseAnalyzer()))
        # initial a directory to storage indexes info
        if not os.path.exists("indexdir"):
            os.mkdir("indexdir")
        self.indexdir = "indexdir"
        self.indexer = create_in(self.indexdir, schema=self.schema)


    def addblog(self, blog):
        writer = self.indexer.writer()
        # write the blog details into indexes
        writer.add_document(url=blog['url'],
                            title=blog['title'],
                            tags=blog['tags'],
                            description=blog['description'],
                            content=blog['content'])
        writer.commit()

    def search(self, querystring):
        # make sure the query string is unicode string.
        # querystring = u'{}'.format(querystring)
        with self.indexer.searcher() as seracher:
            query = QueryParser('content', self.schema).parse(querystring)
            results = seracher.search(query)
            # for item in results:
            #    print(item)
        return results

演示

好了，差不多就是這樣了。下面來看下執行的效果。

案例一

首先看下對於DBHelper這個關鍵字的搜尋，因為文章過多的話計算也是比較慢的，所以就爬取前幾篇文章好了。

# coding: utf8

# @Author: 郭 璞
# @File: TestAll.py                                                                 
# @Time: 2017/5/12                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 
from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcher

login = Login()
session = login.login(username="Username", password="password")
print(session)

scanner = BlogScanner(domain="Marksinoberg")
blogs = scanner.scan()
print(blogs[0:3])

blogdetails = BlogDetails(session=session, blogurl=blogs[0])
blog = blogdetails.getSource()
print(blog['url'])
print(blog['description'])
print(blog['tags'])

# test whoosh for searcher
searcher = Searcher()
counter=1
for item in blogs[0:7]:
    print("開始處理第{}篇文章".format(counter))
    counter+=1
    details = BlogDetails(session=session, blogurl=item).getSource()
    searcher.addblog(details)
# searcher.addblog(blog)
searcher.search('DbHelper')
# searcher.search('Python')

程式碼執行結果如下：
DBHelper關鍵字查詢結果

不難發現，本人部落格只有前兩篇是關於DBHelper 的文章，所以命中了這兩個document。看起來還不錯。

案例二

下面再來試試其他的關鍵字。比如Python。

# coding: utf8

# @Author: 郭 璞
# @File: TestAll.py                                                                 
# @Time: 2017/5/12                                   
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: 
from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcher

login = Login()
session = login.login(username="username", password="password")
print(session)

scanner = BlogScanner(domain="Marksinoberg")
blogs = scanner.scan()
print(blogs[0:3])

blogdetails = BlogDetails(session=session, blogurl=blogs[0])
blog = blogdetails.getSource()
print(blog['url'])
print(blog['description'])
print(blog['tags'])

# test whoosh for searcher
searcher = Searcher()
counter=1
for item in blogs[0:10]:
    print("開始處理第{}篇文章".format(counter))
    counter+=1
    details = BlogDetails(session=session, blogurl=item).getSource()
    searcher.addblog(details)
# searcher.addblog(blog)
# searcher.search('DbHelper')
searcher.search('Python')

然後依然來看下執行的效果。
Python關鍵字查詢效果

命中了4條記錄，命中率也還算說得過去。

總結

最後來總結下。關於whoosh站內搜尋的問題，要向更高精度的匹配到文字結果，其實還需要很多地方優化。QueryParser 這塊其實還有很多需要挖掘。

另外高亮顯示查詢結果也是很方便的。官方文件上有詳細的介紹。

最後一步就是中文問題，目前我還沒有什麼好的辦法來提高分詞和命中率。

關於站內搜尋的那些事兒

前言演示案例一案例二總結前言然後現在用的是Python了，所以需要迭代一下。網上搜索了下，相關的還真不少，還有pylucene，但是相比較而言，whoosh更為出色。那今天就用它吧。安裝它也比較簡單。 pip i

hexo next主題站內搜尋出現異常，無法正常跳轉，跳轉時出現異常

主要看看跳轉後的url是什麼，如果url異常，就需要在站點配置檔案（注意不是主題配置檔案）下面看看你的url和永久連結設定的是否正確。如下所示： # URL ## If your site is put in a subdirectory, set url as 'http://yo

利用JQuery傳送ajax請求進行站內搜尋

前臺程式碼：（注意要匯入JQuery包，在lib中也要匯入gson-x.x.x.jar包） <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%> &l

利用JQuery傳送ajax請求進行站內搜尋(Hibernate篇——超簡單系列)

此次任務是簡單的一次hibernate框架小訓練，利用上一篇的專案修改 1.導包，找到Hibernate-release-5.x.x.Final\lib\required下的所有包，copy到專案WEB-INF\lib下 2.開始編寫資料庫表（名字好煩，就亂取了） //建表h

solr站內搜尋之suggest和詞頻統計

solr站內搜尋與詞頻統計一、suggest關鍵字搜尋（帶聯想建議提示功能）二、autocompleter外掛的使用三、詞頻統計 package com.product.pojo; import java.io.Serializable; import java.util

es站內站內搜尋筆記（一） Mysql 如何設定欄位自動獲取當前時間

es站內站內搜尋筆記（一）第一節：　　概述使用elasticsearch進行網站搜尋，es是當下最流行的分散式的搜尋引擎及大資料分析的中介軟體，搜房網的主要功能：強大的搜尋框，與百度地圖相結合，實現地圖找房，包括前臺模組和後臺模組。　　elasticsearch + mysql +kafka實

基於輕量級php搜尋sphider站內搜尋初級優化

轉載：https://blog.csdn.net/chijiaodaxie/article/details/48714373 站內搜尋初級優化 php1>. 概述：站內搜尋引擎顧名思義即網站內的資訊搜尋引擎,隨著網路的發展，網站已經成為了企業或機構

PHP站內搜尋功能(laravel自帶Scout驅動+elasticsearch)

站內搜尋由於最近做的網站需要用到網站的站內搜尋,我也是偷偷摸摸學了一手，希望有需要的朋友也可以看看搜尋引擎 Elasticsearch 官方網站 https://www.elastic.co/cn/ 原理：輸入內容與記憶體中的想匹配找到對應的文件輸

javaEE Lucene，全文檢索，站內搜尋，入門程式。索引庫的新增

注意：搜尋使用的分析器(分詞器)要和建立索引時使用的分析器一致。 Field類(域物件)： Test.java（入門程式測試類）： package com.xxx.lucene; import static org.junit.Assert.*; im

es簡單打造站內搜尋

最近挺忙的，在外出差，又同時幹兩個專案。白天一個晚上一個，特別是白天做的專案，馬上就要上線了，在客戶這裡三天兩頭開會，問題很多真的很想好好靜下來懟程式碼，半夜做夢都能fix bugs~ 和客戶交流真的是門技術，一不小心你就會掉坑裡，慢慢來吧~ 站內搜素其實也是老生常談，估計很多程式設計師門都做過或者接觸

站內搜尋

關於部落格張戈部落格是關注網際網路以及分享IT運維工作經驗的個人部落格，主要涵蓋了作業系統運維、實用指令碼程式設計以及部落格網站建設等經驗教程。我的部落格宗旨：把最實用的經驗，分享給最需要的讀者，希望每一位來訪的朋友都能有所收穫！

百度站內搜尋https不可用切換api搜尋，加上谷歌api站內搜尋

google推https幾年了，百度開始宣傳全面https，但是，百度站內搜尋自己的服務卻不走https，介面報錯。百度分享也是。然後採用http://search.zhoulujun.cn/cse/search ，用了7-8個月的樣子，還是繼續，不聲不響地改變了配置了，突然就不通了——而且

我的php學習筆記（三十七） PHP站內搜尋：多關鍵字、加亮顯示

一、SQL語句中的模糊查詢主要通過LIKE（不區分大小寫）關鍵字實現模糊查詢。LIKE條件一般用在指定搜尋某欄位的時候, 通過"%"或者" _" 萬用字元的作用實現模糊查詢功能，萬用字元可以在欄位前面也可以在後面或前後都有。只通過LIKE是無法實現模糊查詢的

solr學習筆記 -- day06 模擬京東實現站內搜尋

一：功能分析 1、輸入條件（1）、主條件查詢（2）、根據商品分類名稱過濾（3）、價格期間過濾（4）、價格排序（5）、分頁 2、返回結果（1）、總記錄數（2）、總頁數（3）、商品列表，包括：商品圖片、商品標題、商品價格、關鍵詞高亮顯示二：工程搭建 1、建立一

在網頁中嵌入百度、谷歌搜尋（Web搜尋與站內搜尋）

百度： <SCRIPT language=javascript> function g(formname) { var url = "http://www.baidu.com/baidu"; if (formname.s[1].checked)

PHP站內搜尋：多關鍵字、加亮顯示

一、SQL語句中的模糊查詢主要通過LIKE（不區分大小寫）關鍵字實現模糊查詢。LIKE條件一般用在指定搜尋某欄位的時候, 通過"%"或者" _" 萬用字元的作用實現模糊查詢功能，萬用字元可以在欄位前面也可以在後面或前後都有。只通過LIKE是無法實現模糊查詢的，因

ASP站內搜尋程式碼#

方法一：利用各大搜索的收錄 <script type="text/javascript"> function Gsitesearch(curobj){ var domainroot=curobj.domainroot[curobj.domainroot

google站內搜尋

<form name="form1" action="http://www.google.com/search" target=_blank><td><img src="../Images/SearGoogle.gif" width="146"

百度站內搜尋框自己定義樣式、顯示方式...（瓜頭醬油的發現）供站內搜尋入門者圍觀

這次需要做一個百度的站內搜尋，剛開始做這個東西什麼都不知道，心裡急了，公司的人看我比較急也給我指定了一下，但是全公司就我一個PHP程式設計師，還是我自己摸索吧，哈哈哈哈剛好現在找到了資料... 留一份 1、首先站內搜尋需要一個供使用者輸入的輸入框和可以提交

蘋果內購那些事兒(一)

1.簡介蘋果內購是指Apple Store的應用內購買，是蘋果為App內購買虛擬商品或服務提供的一套交易系統。 1.1內購商品型別 1.1.1消耗型別商品該型別適用於可多次購買的消耗型專案，如遊戲道具、虛擬幣等。 1.1.2非消耗型別商品

關於站內搜尋的那些事兒

前言

模組化

登入模組

部落格掃描模組

部落格詳情模組

搜尋模組

演示

案例一

案例二

總結

相關推薦