1. 程式人生 > >nutch2.2.1安裝部署

nutch2.2.1安裝部署

Enough has changed from Nutch 2.1 to Nutch 2.2 to warrant an update to the installation instructions. These instructions assume Ubuntu 12.04 and Java 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client

 at the command line.

As MySQL defaults to latin we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. The character and collation settings are to handle Unicode correctly.The max_allowed_packet settings is optional and only necessary for very large sizes. Restart your machine for the changes to take effect.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql and you should see something like

tcp 0 0 localhost:mysql *:* LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.2 by downloading the apache-nutch-2.2-src.tar.gz version fromhttp://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded to a folder we will refer to going forward as ${APACHE_NUTCH_HOME}. In my particular case I prefer to use it with Eclipse so I untar it in the Eclipse workspace but this is not necessary.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

change
<dependency org=”org.apache.gora” name=”gora-core” rev=”0.3″ conf=”*->default”/>
to
<dependency org=”org.apache.gora” name=”gora-core” rev=”0.2.1″ conf=”*->default”/>

and uncomment the gora-sql
<dependency org=”org.apache.gora” name=”gora-sql” rev=”0.1.1-incubating” conf=”*->default” />

and uncomment the mysql connector
<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>YourNutchSpider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder

If you are using Eclipse type ant eclipse. When that is finished start up Eclipse and go to File -> Import -> Existing Projects into Workspace -> Browse and add ${APACHE_NUTCH_HOME}. Go to the new project in the Eclipse project explorer and scroll down until you find ant.xml. Right click on ant.xml and select run as -> 1 ant build. This may take a little while to compile.

If you are not using Eclipse after you have cd to ${APACHE_NUTCH_HOME} simply type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):
Inject a URL into the DB

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://nutch.apache.org/' > urls/seed.txt

Start crawling (you will want to create your own script later but manually just to see what is happening type the following into the command line)

bin/nutch inject urls


bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

Repeat the last four commands (generate, fetch, parse and updatedb) again.

For the generate command, topN is the max number of links you want to actually parse each time. The first time there is only one URL (the one we injected from seed.txt) but after that there are many more. Note, however, Nutch keeps track of all links it encounters in the webpage table. It just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN.

Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 320 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

You can easily add more urls to search by hand in seed.txt if you want and then use the command bin/nutch inject urls .

Set up and index with Solr If you are using Nutch 2.2 at this time you are into the bleeding edge and probably want the latest version of Solr 4 as well. Untar it to to $HOME/apache-solr-4.X.X-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download this link and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser. Select collection1 from the core selector.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting content:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but you are at least started.

相關推薦

nutch2.2.1安裝部署

Enough has changed from Nutch 2.1 to Nutch 2.2 to warrant an update to the installation instructions. These instructions assume Ubuntu 12.04 and Java

Hive 1.2.1安裝部署

java.sql.SQLException: Unable to open a test connection to the given database. J                                                                          

hive-2.1.1安裝部署

hive介紹、理解相關,參考: http://www.aboutyun.com/thread-20461-1-1.html http://blog.csdn.net/lifuxiangcaohui/article/details/40145859 https://mp.we

Ambari 2.1安裝HDP2.3.2 之 六、安裝部署HDP叢集 詳細步驟

六、安裝部署HDP叢集 瀏覽器訪問 http://master:8080,進入amabri登入頁面,使用者名稱:admin,密碼: admin 選擇 Launch Install Wizard: 1. Get started 給叢集起個名字

(轉) Hadoop1.2.1安裝

安裝目錄 文件復制 reduce mat 數據保存 jdk1.7 mapreduce tput cat 環境:ubuntu13 使用的用戶為普通用戶。如:用戶ru jdk安裝略 1、安裝ssh (1) Java代碼 sudo apt-get install op

CentOS7.2 LNMP安裝部署zabbix3.2

centos7 zabbix3.2 lnmp 一、安裝說明系統環境CentOS7.2_x64 yum安裝數據庫:mariadb 5.5 yum安裝 zabbix 3.2.8編譯安裝nginx 1.2.1(最新穩定版) 編譯安裝 php 5.6.31最新穩定版關閉selinux防火墻添加

nagios客戶端之nrpe3.2.1安裝

init nagios插件 源碼 load 啟動 files plugin ubunt tar.gz 1、刪除dpkg安裝的nrpedpkg -l | grep nrpedkpg -P nagios-nrpe-server 2、ubuntu下nrpe3.2.1安裝 下載nr

elasticsearch6.6.1安裝部署

訪問 sta 進行 true filter 出現異常 nload tps allow 1.下載安裝包 https://www.elastic.co/cn/downloads/elasticsearch我下的是Linux環境的tar包 2.解壓安裝包 tar -xvf ela

hive2.1.1安裝部署

version -c sset direct out replace 感謝 查看表 變量 一、Hive 運行模式 與 Hadoop 類似,Hive 也有 3 種運行模式: 1. 內嵌模式 將元數據保存在本地內嵌的 Derby 數據庫中,這是使用 Hive 最

windows server 2012 r2 App-V 5.1 安裝部署

aec 擴展 另一個 系統 安裝程序 host 上傳 local 服務器角色 準備軟件 1.sqlserver2014 2.mdop2015光盤-----軟件下載地址:ed2k://|file|mu_microsoft_desktop_optimization_pack

zabbix3.2.1安裝graphtrees插件

上下 官網 apach zabbix3.2 nload 直接 aik alt 下載 https://blog.csdn.net/liang_baikai/article/details/53542317 graphtree介紹 由於zabbix的圖像顯示一塊不太友好,圖像沒

在CDH上用外部Spark2.2.1安裝和配置 CarbonData

表示 相關 iyu top arch slaves path 中央倉庫 tar -zcvf 在CDH上用外部Spark2.2.1 (hadoop free版本)standalone 模式安裝和配置 CarbonData ===================

CDH5.12.1 安裝部署

數據庫連接 eve zookeep license onf tran 根據 info 控制 ###通過http://192.168.50.200:7180/cmf/login 訪問CM控制臺 4.CDH安裝 4.1CDH集群安裝向導 1.admin/admin登陸

Kettle-6.1安裝部署及使用教程

一、Kettle概念 Kettle是一款國外開源的ETL工具,純java編寫,可以在Window、Linux、Unix上執行,綠色無需安裝,資料抽取高效穩定。 Kettle 中文名稱叫水壺,該專案的主程式設計師MATT 希望把各種資料放到一個壺裡,然後以一種指定的格式流出。 Kettle這個ETL工具集

Android Studio 3.2.1安裝問題解決辦法

由於公司設定代理伺服器,導致安卓工程一直停留在gradle中迴圈,解決辦法如下: 修改兩個檔案,在目錄C:\Users\Administrator.20181015CSB\.gradle下: gradle.properties init.gradle 其中: 【gradle.

Symantec NetBackup 8.1安裝部署

[[email protected] ~]# mkdir nbu [[email protected] ~]# cd nbu/ [[email protected] nbu]# rz rz waiting to receive. Starting zmo

elasticsearch 5.2.1安裝問題解答

1、啟動 elasticsearch 如出現異常  can not run elasticsearch as root   解決方法:建立ES 賬戶,修改資料夾 檔案 所屬使用者 組 2、啟動異常:ERROR: bootstrap checks failed syst

hbase-1.3.1安裝部署

參考: http://blog.csdn.net/shenfuli/article/details/52765975 http://blog.csdn.net/lifuxiangcaohui/article/details/39854737 一. 環境介紹 三臺主機

安裝msyql的情況下為php7.2.1安裝mysqli擴充套件

環境為新的阿里雲ECS雲主機,沒有事先編譯安裝mysql,也沒有事先yum安裝mysql。 在編譯安裝php7.2.1完成後,發現mysqli擴充套件沒有安裝。 解決辦法為進入到php7.2.1原始碼包( 實際路徑可能不同): cd /mnt/soft/php-7.2.1/ext

【solr】solr6.4.1安裝部署至tomcat教程

jdk 1.8 tomcat8 二、 安裝solr到tomcat 1 解壓solr ,把 solr-6.2.0\solr-6.1.0\server\solr-webapp下的 webapp 資料夾拷貝到tomcat 的webapps下,重新命名