php+中文分詞scws+sphinx+mysql打造千萬級數據全文搜索
Sphinx是由俄羅斯人Andrew Aksyonoff開發的一個全文檢索引擎。意圖為其他應用提供高速、低空間占用、高結果 相關度的全文搜索功能。Sphinx可以非常容易的與SQL數據庫和腳本語言集成。當前系統內置MySQL和PostgreSQL 數據庫數據源的支持,也支持從標準輸入讀取特定格式 的XML數據。
Sphinx創建索引的速度為:創建100萬條記錄的索引只需3~4分鐘,創建1000萬條記錄的索引可以在50分鐘內完成,而只包含最新10萬條記錄的增量索引,重建一次只需幾十秒。
Sphinx的特性如下:
a) 高速的建立索引(在當代CPU上,峰值性能可達到10 MB/秒);
b) 高性能的搜索(在2 – 4GB 的文本數據上,平均每次檢索響應時間小於0.1秒);
d) 提供了優秀的相關度算法,基於短語相似度和統計(BM25)的復合Ranking方法;
e) 支持分布式搜索;
f) 支持短語搜索
g) 提供文檔摘要生成
h) 可作為MySQL的存儲引擎提供搜索服務;
i) 支持布爾、短語、詞語相似度等多種檢索模式;
j) 文檔支持多個全文檢索字段(最大不超過32個);
k) 文檔支持多個額外的屬性信息(例如:分組信息,時間戳等);
l) 支持斷詞;
雖然mysql的MYISAM提供全文索引,但是性能卻不敢讓人恭維
開始搭建
系統環境:centos6.5+php5.6+apache+MySQL
1、安裝依賴包
[php] view plain copy- yum -y install make gcc g++ gcc-c++ libtool autoconf automake imake php-devel mysql-devel libxml2-devel expat-devel
2、安裝Sphinx
- yum install expat expat-devel
- wget -c http://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz
- tar zxvf sphinx-2.0.7-release.tar.gz
- cd sphinx-2.0.7-release
- ./configure --prefix=/usr/local/sphinx --with-mysql --with-libexpat --enable-id64
- make && make install
3、安裝libsphinxclient,PHP擴展用到
[php] view plain copy
- cd api/libsphinxclient
- ./configure --prefix=/usr/local/sphinx/libsphinxclient
- make && make install
4、安裝Sphinx的PHP擴展:我的是5.6需裝sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz
[php] view plain copy
- wget -c http://pecl.php.net/get/sphinx-1.3.3.tgz
- tar zxvf sphinx-1.3.3.tgz
- cd sphinx-1.3.3
- phpize
- ./configure --with-sphinx=/usr/local/sphinx/libsphinxclient/ --with-php-config=/usr/bin/php-config
- make && make install
- 成功後會提示:
- Installing shared extensions: /usr/lib64/php/modules/
- echo "[Sphinx]" >> /etc/php.ini
- echo "extension = sphinx.so" >> /etc/php.ini
- #重啟apache
- service httpd restart
5、創建測試數據
[php] view plain copy
- CREATE TABLE IF NOT EXISTS `items` (
- `id` int(11) NOT NULL AUTO_INCREMENT,
- `title` varchar(255) NOT NULL,
- `content` text NOT NULL,
- `created` datetime NOT NULL,
- PRIMARY KEY (`id`)
- ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=‘全文檢索測試的數據表‘ AUTO_INCREMENT=11 ;
- INSERT INTO `items` (`id`, `title`, `content`, `created`) VALUES
- (1, ‘linux mysql集群安裝‘, ‘MySQL Cluster 是MySQL 適合於分布式計算環境的高實用、可拓展、高性能、高冗余版本‘, ‘2016-09-07 00:00:00‘),
- (2, ‘mysql主從復制‘, ‘mysql主從備份(復制)的基本原理 mysql支持單向、異步復制,復制過程中一個服務器充當主服務器,而一個或多個其它服務器充當從服務器‘, ‘2016-09-06 00:00:00‘),
- (3, ‘hello‘, ‘can you search me?‘, ‘2016-09-05 00:00:00‘),
- (4, ‘mysql‘, ‘mysql is the best database?‘, ‘2016-09-03 00:00:00‘),
- (5, ‘mysql索引‘, ‘關於MySQL索引的好處,如果正確合理設計並且使用索引的MySQL是一輛蘭博基尼的話,那麽沒有設計和使用索引的MySQL就是一個人力三輪車‘, ‘2016-09-01 00:00:00‘),
- (6, ‘集群‘, ‘關於MySQL索引的好處,如果正確合理設計並且使用索引的MySQL是一輛蘭博基尼的話,那麽沒有設計和使用索引的MySQL就是一個人力三輪車‘, ‘0000-00-00 00:00:00‘),
- (9, ‘復制原理‘, ‘redis也有復制‘, ‘0000-00-00 00:00:00‘),
- (10, ‘redis集群‘, ‘集群技術是構建高性能網站架構的重要手段,試想在網站承受高並發訪問壓力的同時,還需要從海量數據中查詢出滿足條件的數據,並快速響應,我們必然想到的是將數據進行切片,把數據根據某種規則放入多個不同的服務器節點,來降低單節點服務器的壓力‘, ‘0000-00-00 00:00:00‘);
- CREATE TABLE IF NOT EXISTS `sph_counter` (
- `counter_id` int(11) NOT NULL,
- `max_doc_id` int(11) NOT NULL,
- PRIMARY KEY (`counter_id`)
- ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT=‘增量索引標示的計數表‘;
以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自帶的一元分詞。
6、Sphinx配置:註意修改數據源配置信息
[php] view plain copy
- vi /usr/local/sphinx/etc/sphinx.conf
- source items {
- type = mysql
- sql_host = localhost
- sql_user = root
- sql_pass = 123456
- sql_db = sphinx_items
- sql_query_pre = SET NAMES utf8
- sql_query_pre = SET SESSION query_cache_type = OFF
- sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM items
- sql_query_range = SELECT MIN(id), MAX(id) FROM items \
- WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
- sql_range_step = 1000
- sql_ranged_throttle = 1000
- sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
- WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) \
- AND id >= $start AND id <= $end
- sql_attr_timestamp = created
- sql_attr_bool = deleted
- }
- source items_delta : items {
- sql_query_pre = SET NAMES utf8
- sql_query_range = SELECT MIN(id), MAX(id) FROM items \
- WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
- sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
- WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) \
- AND id >= $start AND id <= $end
- sql_query_post_index = set @max_doc_id :=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
- sql_query_post_index = REPLACE INTO sph_counter SELECT 2, IF($maxid, $maxid, @max_doc_id)
- }
- #主索引
- index items {
- source = items
- path = /usr/local/sphinx/var/data/items
- docinfo = extern
- morphology = none
- min_word_len = 1
- min_prefix_len = 0
- html_strip = 1
- html_remove_elements = style, script
- ngram_len = 1
- ngram_chars = U+3000..U+2FA1F
- charset_type = utf-8
- charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
- preopen = 1
- min_infix_len = 1
- }
- #增量索引
- index items_delta : items {
- source = items_delta
- path = /usr/local/sphinx/var/data/items-delta
- }
- #分布式索引
- index master {
- type = distributed
- local = items
- local = items_delta
- }
- indexer {
- mem_limit = 256M
- }
- searchd {
- listen = 9312
- listen = 9306:mysql41 #Used for SphinxQL
- log = /usr/local/sphinx/var/log/searchd.log
- query_log = /usr/local/sphinx/var/log/query.log
- compat_sphinxql_magics = 0
- attr_flush_period = 600
- mva_updates_pool = 16M
- read_timeout = 5
- max_children = 0
- dist_threads = 2
- pid_file = /usr/local/sphinx/var/log/searchd.pid
- max_matches = 1000
- seamless_rotate = 1
- preopen_indexes = 1
- unlink_old = 1
- workers = threads # for RT to work
- binlog_path = /usr/local/sphinx/var/data
- }
保存退出
7、Sphinx創建索引
[php] view plain copy
- #第一次需重建索引:
- [[email protected] bin]# ./indexer -c /usr/local/sphinx/etc/sphinx.conf --all
- Sphinx 2.0.7-id64-release (r3759)
- Copyright (c) 2001-2012, Andrew Aksyonoff
- Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
- using config file ‘/usr/local/sphinx/etc/sphinx.conf‘...
- indexing index ‘items‘...
- collected 8 docs, 0.0 MB
- sorted 0.0 Mhits, 100.0% done
- total 8 docs, 1121 bytes
- total 1.017 sec, 1101 bytes/sec, 7.86 docs/sec
- indexing index ‘items_delta‘...
- collected 0 docs, 0.0 MB
- total 0 docs, 0 bytes
- total 1.007 sec, 0 bytes/sec, 0.00 docs/sec
- skipping non-plain index ‘master‘...
- total 4 reads, 0.000 sec, 0.7 kb/call avg, 0.0 msec/call avg
- total 14 writes, 0.001 sec, 0.5 kb/call avg, 0.1 msec/call avg
- #啟動sphinx
- [[email protected] bin]# ./searchd -c /usr/local/sphinx/etc/sphinx.conf
- Sphinx 2.0.7-id64-release (r3759)
- Copyright (c) 2001-2012, Andrew Aksyonoff
- Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
- using config file ‘/usr/local/sphinx/etc/sphinx.conf‘...
- listening on all interfaces, port=9312
- listening on all interfaces, port=9306
- precaching index ‘items‘
- precaching index ‘items_delta‘
- rotating index ‘items_delta‘: success
- precached 2 indexes in 0.012 sec
- #查看進程
- [[email protected] bin]# ps -ef | grep searchd
- root 30431 1 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf
- root 30432 30431 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf
- root 30437 1490 0 23:59 pts/0 00:00:00 grep searchd
- #停止Searchd:
- ./searchd -c /usr/local/sphinx/etc/sphinx.conf --stop
- #查看Searchd狀態:
- ./searchd -c /usr/local/sphinx/etc/sphinx.conf --status
索引更新及使用說明
"增量索引"每N分鐘更新一次.通常在每天晚上低負載的時進行一次索引合並,同時重新建立"增量索引"。當然"主索引"數據不多的話,也可以直接重新建立"主索引"。
API搜索的時,同時使用"主索引"和"增量索引",這樣可以獲得準實時的搜索數據.本文的Sphinx配置將"主索引"和"增量索引"放到分布式索引master中,因此只需查詢分布式索引"master"即可獲得全部匹配數據(包括最新數據)。
索引的更新與合並的操作可以放到cron job完成:
- crontab -e
- */1 * * * * /usr/local/sphinx/shell/delta_index_update.sh
- 0 3 * * * /usr/local/sphinx/shell/merge_daily_index.sh
- crontab -l
cron job所用的shell腳本例子:
delta_index_update.sh:
[php] view plain copy- #!/bin/bash
- /usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1
merge_daily_index.sh:
[php] view plain copy- #!/bin/bash
- indexer=`which indexer`
- mysql=`which mysql`
- QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;"
- index_counter=$($mysql -h192.168.1.198 -uroot -p123456 -sN -e "$QUERY")
- #merge "main + delta" indexes
- $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1
- if [ "$?" -eq 0 ]; then
- ##update sphinx counter
- if [ ! -z $index_counter ]; then
- $mysql -h192.168.1.198 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, ‘$index_counter‘)"
- fi
- ##rebuild delta index to avoid confusion with main index
- $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1
- fi
8、php中文分詞scws安裝:註意擴展的版本和php的版本
[php] view plain copy
- wget -c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
- tar jxvf scws-1.2.3.tar.bz2
- cd scws-1.2.3
- ./configure --prefix=/usr/local/scws
- make && make install
9、scws的PHP擴展安裝:
[php] view plain copy
- cd ./phpext
- phpize
- ./configure
- make && make install
- echo "[scws]" >> /etc/php.ini
- echo "extension = scws.so" >> /etc/php.ini
- echo "scws.default.charset = utf-8" >> /etc/php.ini
- echo "scws.default.fpath = /usr/local/scws/etc/" >> /etc/php.ini
10、詞庫安裝:
- wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2
- tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/
- chown www:www /usr/local/scws/etc/dict.utf8.xdb
11、php使用Sphinx+scws測試例子
在Sphinx源碼API中,有好幾種語言的API調用.其中有一個是sphinxapi.php。
不過以下的測試使用的是Sphinx的PHP擴展.具體安裝見本文開頭的Sphinx安裝部分。
測試用的搜索類Search.php:註意修改getDBConnection()信息為自己的
[php] view plain copy
- <?php
- class Search {
- /**
- * @var SphinxClient
- **/
- protected $client;
- /**
- * @var string
- **/
- protected $keywords;
- /**
- * @var resource
- **/
- private static $dbconnection = null;
- /**
- * Constructor
- **/
- public function __construct($options = array()) {
- $defaults = array(
- ‘query_mode‘ => SPH_MATCH_EXTENDED2,
- ‘sort_mode‘ => SPH_SORT_EXTENDED,
- ‘ranking_mode‘ => SPH_RANK_PROXIMITY_BM25,
- ‘field_weights‘ => array(),
- ‘max_matches‘ => 1000,
- ‘snippet_enabled‘ => true,
- ‘snippet_index‘ => ‘items‘,
- ‘snippet_fields‘ => array(),
- );
- $this->options = array_merge($defaults, $options);
- $this->client = new SphinxClient();
- //$this->client->setServer("192.168.1.198", 9312);
- $this->client->setMatchMode($this->options[‘query_mode‘]);
- if ($this->options[‘field_weights‘] !== array()) {
- $this->client->setFieldWeights($this->options[‘field_weights‘]);
- }
- /*
- if ( in_array($this->options[‘query_mode‘], [SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED]) ) {
- $this->client->setRankingMode($this->options[‘ranking_mode‘]);
- }
- */
- }
- /**
- * Query
- *
- * @param string $keywords
- * @param integer $offset
- * @param integer $limit
- * @param string $index
- * @return array
- **/
- public function query($keywords, $offset = 0, $limit = 10, $index = ‘*‘) {
- $this->keywords = $keywords;
- $max_matches = $limit > $this->options[‘max_matches‘] ? $limit : $this->options[‘max_matches‘];
- $this->client->setLimits($offset, $limit, $max_matches);
- $query_results = $this->client->query($keywords, $index);
- if ($query_results === false) {
- $this->log(‘error:‘ . $this->client->getLastError());
- }
- $res = [];
- if ( empty($query_results[‘matches‘]) ) {
- return $res;
- }
- $res[‘total‘] = $query_results[‘total‘];
- $res[‘total_found‘] = $query_results[‘total_found‘];
- $res[‘time‘] = $query_results[‘time‘];
- $doc_ids = array_keys($query_results[‘matches‘]);
- unset($query_results);
- $res[‘data‘] = $this->fetch_data($doc_ids);
- if ($this->options[‘snippet_enabled‘]) {
- $this->buildExcerptRows($res[‘data‘]);
- }
- return $res;
- }
- /**
- * custom sorting
- *
- * @param string $sortBy
- * @param int $mode
- * @return bool
- **/
- public function setSortBy($sortBy = ‘‘, $mode = 0) {
- if ($sortBy) {
- $mode = $mode ?: $this->options[‘sort_mode‘];
- $this->client->setSortMode($mode, $sortBy);
- } else {
- $this->client->setSortMode(SPH_SORT_RELEVANCE);
- }
- }
- /**
- * fetch data based on matched doc_ids
- *
- * @param array $doc_ids
- * @return array
- **/
- protected function fetch_data($doc_ids) {
- $ids = implode(‘,‘, $doc_ids);
- $queries = self::getDBConnection()->query("SELECT * FROM items WHERE id in ($ids)", PDO::FETCH_ASSOC);
- return iterator_to_array($queries);
- }
- /**
- * build excerpts for data
- *
- * @param array $rows
- * @return array
- **/
- protected function buildExcerptRows(&$rows) {
- $options = array(
- ‘before_match‘ => ‘<b style="color:red">‘,
- ‘after_match‘ => ‘</b>‘,
- ‘chunk_separator‘ => ‘...‘,
- ‘limit‘ => 256,
- ‘around‘ => 3,
- ‘exact_phrase‘ => false,
- ‘single_passage‘ => true,
- ‘limit_words‘ => 5,
- );
- $scount = count($this->options[‘snippet_fields‘]);
- foreach ($rows as &$row) {
- foreach ($row as $fk => $item) {
- if (!is_string($item) || ($scount && !in_array($fk, $this->options[‘snippet_fields‘])) ) continue;
- $item = preg_replace(‘/[\r\t\n]+/‘, ‘‘, strip_tags($item));
- $res = $this->client->buildExcerpts(array($item), $this->options[‘snippet_index‘], $this->keywords, $options);
- $row[$fk] = $res === false ? $item : $res[0];
- }
- }
- return $rows;
- }
- /**
- * database connection
- *
- * @return resource
- **/
- private static function getDBConnection() {
- $dsn = ‘mysql:host=192.168.1.198;dbname=sphinx_items‘;
- $user = ‘root‘;
- $pass = ‘123456‘;
- if (!self::$dbconnection) {
- try {
- self::$dbconnection = new PDO($dsn, $user, $pass);
- } catch (PDOException $e) {
- die(‘Connection failed: ‘ . $e->getMessage());
- }
- }
- return self::$dbconnection;
- }
- /**
- * Chinese words segmentation
- *
- **/
- public function wordSplit($keywords) {
- $fpath = ini_get(‘scws.default.fpath‘);
- $so = scws_new();
- $so->set_charset(‘utf-8‘);
- $so->add_dict($fpath . ‘/dict.utf8.xdb‘);
- //$so->add_dict($fpath .‘/custom_dict.txt‘, SCWS_XDICT_TXT);
- $so->set_rule($fpath . ‘/rules.utf8.ini‘);
- $so->set_ignore(true);
- $so->set_multi(false);
- $so->set_duality(false);
- $so->send_text($keywords);
- $words = [];
- $results = $so->get_result();
- foreach ($results as $res) {
- $words[] = ‘(‘ . $res[‘word‘] . ‘)‘;
- }
- $words[] = ‘(‘ . $keywords . ‘)‘;
- return join(‘|‘, $words);
- }
- /**
- * get current sphinx client
- *
- * @return resource
- **/
- public function getClient() {
- return $this->client;
- }
- /**
- * log error
- **/
- public function log($msg) {
- // log errors here
- //echo $msg;
- }
- /**
- * magic methods
- **/
- public function __call($method, $args) {
- $rc = new ReflectionClass(‘SphinxClient‘);
- if ( !$rc->hasMethod($method) ) {
- throw new Exception(‘invalid method :‘ . $method);
- }
- return call_user_func_array(array($this->client, $method), $args);
- }
- }
測試文件test.php:
[php] view plain copy
- <?php
- require(‘Search.php‘);
- $s = new Search([
- ‘snippet_fields‘ => [‘title‘, ‘content‘],
- ‘field_weights‘ => [‘title‘ => 20, ‘content‘ => 10],
- ]);
- $s->setSortMode(SPH_SORT_EXTENDED, ‘created desc,@weight desc‘);
- //$s->setSortBy(‘created desc,@weight desc‘);
- $words = $s->wordSplit("mysql集群");//先分詞 結果:(mysql)|(mysql集群)
- //print_r($words);exit;
- $res = $s->query($words, 0, 10, ‘master‘);
- echo ‘<pre/>‘;print_r($res);
測試結果:
12、SphinxQL測試
要使用SphinxQL需要在Searchd的配置裏面增加相應的監聽端口(參考上文配置)。
[php] view plain copy
- [[email protected] bin]# mysql -h127.0.0.1 -P9306 -uroot -p
- Enter password:
- Welcome to the MySQL monitor. Commands end with ; or \g.
- Your MySQL connection id is 1
- Server version: 2.0.7-id64-release (r3759)
- Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
- Oracle is a registered trademark of Oracle Corporation and/or its
- affiliates. Other names may be trademarks of their respective
- owners.
- Type ‘help;‘ or ‘\h‘ for help. Type ‘\c‘ to clear the current input statement.
- mysql> show global variables;
- +----------------------+---------+
- | Variable_name | Value |
- +----------------------+---------+
- | autocommit | 1 |
- | collation_connection | libc_ci |
- | query_log_format | plain |
- | log_level | info |
- +----------------------+---------+
- 4 rows in set (0.00 sec)
- mysql> desc items;
- +---------+-----------+
- | Field | Type |
- +---------+-----------+
- | id | bigint |
- | title | field |
- | content | field |
- | created | timestamp |
- | deleted | bool |
- +---------+-----------+
- 5 rows in set (0.00 sec)
- mysql> select * from master where match (‘mysql集群‘) limit 10;
- +------+---------+---------+
- | id | created | deleted |
- +------+---------+---------+
- | 1 | 2016 | 0 |
- | 6 | 0 | 0 |
- +------+---------+---------+
- 2 rows in set (0.00 sec)
- mysql> show meta;
- +---------------+-------+
- | Variable_name | Value |
- +---------------+-------+
- | total | 2 |
- | total_found | 2 |
- | time | 0.006 |
- | keyword[0] | mysql |
- | docs[0] | 5 |
- | hits[0] | 15 |
- | keyword[1] | 集 |
- | docs[1] | 3 |
- | hits[1] | 4 |
- | keyword[2] | 群 |
- | docs[2] | 3 |
- | hits[2] | 4 |
- +---------------+-------+
- 12 rows in set (0.00 sec)
- mysql>
php+中文分詞scws+sphinx+mysql打造千萬級數據全文搜索