1. 程式人生 > >Mac自己搭建爬蟲搜索引擎(nutch+elasticsearch是失敗的嘗試,改用scrapy+elasticsearch)

Mac自己搭建爬蟲搜索引擎(nutch+elasticsearch是失敗的嘗試,改用scrapy+elasticsearch)

des scrip mapping 很好 信息 value xtra b+ cal

1.引言

項目需要做爬蟲並能提供個性化信息檢索及推送,發現各種爬蟲框架。其中比較吸引的是這個:

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎

E文原文在:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

考慮用docker把系統搭建起來測試:

docker來源如下:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

https://store.docker.com/community/images/pure/nutch-mongo

然而,docker下載image時實在是太慢,放棄docker!

Mac 設置JAVA_HOME:

vi ~/.bash_profile

export JAVA_HOME=$(/usr/libexec/java_home)
export PATH=$JAVA_HOME/bin:$PATH
export CLASS_PATH=$JAVA_HOME/lib

2.安裝Mongo

Mac下直接用brew安裝,此時最新版本是3.4.7。

安裝好後建/data/db目錄,mongod啟動服務。

測試可用mongo命令連接,輸入dbs查看數據庫。

brew install mongo
sudo mkdir /data/db
sudo chown <你都用戶名>  /data

mongod

3.安裝es+kibana

下載es, 最新版是5.5.1. 地址:https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz

修改配置

$ vim config/elasticsearch.yml cluster.name: my-application node.name: "node-1" node.master: true node.data: true path.data: /opt/elasticsearch/data network.bind_host: 127.0.0.1 network.publish_host: 127.0.0.1
network.host: 127.0.0.1 運行命令:bin/elasticsearch 瀏覽器訪問:http://localhost:9200

下載kibana, 最新版是5.5.1,地址:Mac

運行命令:bin/kibana

瀏覽器訪問:http://localhost:5601

4.安裝Apache nutch

下載Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html

配置環境變量:export NUTCH_HOME=$(pwd)

修改配置

$ cat conf/nutch-site.xml <configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> </configuration> 解除註釋mongodb相關註釋: $NUTCH_HOME/ivy/ivy.xml:

<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />

$NUTCH_HOME/conf/gora.properties

############################ # MongoDBStore properties # ############################ gora.datastore.default=org.apache.gora.mongodb.store.MongoStore gora.mongodb.override_hadoop_configuration=false gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=localhost:27017 gora.mongodb.db=nutch 重要!需要更新elastic插件!原插件版本1.4.1,現最新是5.5.1. 修改

cd src/plugin/indexer-elastic/

vi src/plugin/indexer-elastic/ivy.xml

...

<dependencies>

<dependency org="org.elasticsearch" name="elasticsearch"

rev="5.5.1" conf="*->default" />

</dependencies>

...

ant -f ./build-ivy.xml

ls lib 查看版本,更新plugin.xml中版本號。

<library name="HdrHistogram-2.1.9.jar"/>
<library name="elasticsearch-5.5.1.jar"/>
<library name="hppc-0.7.1.jar"/>
<library name="jackson-core-2.8.6.jar"/>
<library name="jackson-dataformat-cbor-2.8.6.jar"/>
<library name="jackson-dataformat-smile-2.8.6.jar"/>
<library name="jackson-dataformat-yaml-2.8.6.jar"/>
<library name="jna-4.4.0.jar"/>
<library name="joda-time-2.9.5.jar"/>
<library name="jopt-simple-5.0.2.jar"/>
<library name="log4j-api-2.8.2.jar"/>
<library name="lucene-analyzers-common-6.6.0.jar"/>
<library name="lucene-backward-codecs-6.6.0.jar"/>
<library name="lucene-core-6.6.0.jar"/>
<library name="lucene-grouping-6.6.0.jar"/>
<library name="lucene-highlighter-6.6.0.jar"/>
<library name="lucene-join-6.6.0.jar"/>
<library name="lucene-memory-6.6.0.jar"/>
<library name="lucene-misc-6.6.0.jar"/>
<library name="lucene-queries-6.6.0.jar"/>
<library name="lucene-queryparser-6.6.0.jar"/>
<library name="lucene-sandbox-6.6.0.jar"/>
<library name="lucene-spatial-6.6.0.jar"/>
<library name="lucene-spatial-extras-6.6.0.jar"/>
<library name="lucene-spatial3d-6.6.0.jar"/>
<library name="lucene-suggest-6.6.0.jar"/>
<library name="securesm-1.1.jar"/>
<library name="snakeyaml-1.15.jar"/>
<library name="t-digest-3.0.jar"/>

然而!更大的坑是這個plugin代碼出錯了!不折騰了,放棄!

開始編譯:ant runtime (跑了33分鐘!)

結論

1. nutch 2.x 與 elasticsearch 5.x暫時不能很好兼容,不想折騰,放棄。

2. 下次嘗試新的架構:scrapy + scrapy-redis + mongodb + elasticsearch

Mac自己搭建爬蟲搜索引擎(nutch+elasticsearch是失敗的嘗試,改用scrapy+elasticsearch)