Solr Indexing（solr 官方文件）

阿新 • • 發佈：2019-02-02

solr的索引可以接受不同的資料來源，包括：xml，csv檔案，資料庫表和常見的檔案格式的檔案（word，PDF）

有三種常用的方式來載入資料到solr的索引中：

1、Solr Cell 框架提取二進位制檔案或結構化檔案，如office word，PDF等

2、向solr伺服器傳送HTTP請求來載入xml檔案

3、利用solr的java API

solr的索引包含一個或多個document，一個document可以包含多個field，field可以為空，利用field： unique ID （和資料庫的主鍵類似），但該field不是必須的。

field的一般都會在schema.xml 檔案中相應的field對應，然後按照該檔案裡定義的步驟進行解析，如果在schema.xml沒有此filed，則會被忽略或者對映到schema.xml

中的dynamic field。

1、index handler

index handler是一種request handler，用來新增、刪除和更新document。另外，匯入大量的document的使用Tika或者Data Import Handler(用於結構化資料)、

XML：引數：Content-type:application/xml /Content-type: text/xml

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'

Json：引數：Content-Type:application/json or Content-Type: text/json.

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'

----------------------------------------------------------------------------------------------------------------------

note: Comments are not allowed in JSON, but duplicate names are

2、Solr Cell using Apache Tika//Apache Tika 利用現有的解析類庫，從不同格式的文件中（例如HTML, PDF, Doc)，偵測和提取出元資料和結構化內容

ExtractingRequestHandler;

3、DataImpothandler（重點）

The Data Import Handler (DIH) provides a mechanism for importing content from a data store and
indexing it. In addition to relational databases, DIH can index content from HTTP based data sources such as
RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate
fields.

概念與術語

DataSource：資料來源，基於URL

Entity：實體，一個實體被處理為一串document，包含多個欄位，然後會被solr作為索引。對於關係型資料庫，一個實體就是一個檢視或一張表，然後被處理成一個或多個sql語句，對應裡一行或多行行記錄（一條記錄對應一個document）和一列或多列（一列對應一個field）

Processor：處理器。用來衝資料來源中提取資料，轉換成document新增到索引中。一個自定義的實體處理器可以被重寫來擴充套件或代替一個supplied。

Tranformer：轉換器，修改field，建立一個新的field或通過一行記錄形成多個document。主要用來修改日期和過濾HTML。可以使用公共可用介面自定義一個轉換器。

配置：

solrconfig.xm;

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>

--------------------------------------------------------------------------------------------

(example/example-DIH/solr/db/conf/db-data-config.xml).

<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver"
url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>

<document>
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">
<field column="NAME" name="name" />

<entity name="feature"
query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>
<entity name="item_category"
query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
<entity name="category"
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>

</entity>
</document>
</dataConfig>

a、Data Import Handler Commands、

DIH通過HTTPrequest傳送到solr。

引數common：

abort（停止正在進行的操作）；

delta-import（增量匯入和變化檢測，也支援clean、commit、optimize and debug parameters as full-import command）（SqlEntityProcessor）

full-import::該操作會立即返回，也將開啟一個新的執行緒，並且status屬性為busy，會儲存操作開始時間在conf/dataimport。properties以用來以後的delta-import

reload-config:如果配置檔案更改後，又不想重啟solr

status：It returns statistics on the number of documents created, deleted,queries run, rows fetched, status, and so on.

show-config、

---------------------------------------------------------------

Property Writer：

資料來源：

你可以自定義一個數據源通過繼承org.apache.solr.handler.dataimport.DataSource

可以使用的資料來源型別：

ContentStreamDataSource：EntityProcesssor + DataSource<Reader>

Field ReaderDataSource:XML;XPathEntityProcessor + JDBC + FieldReader

FileDataSource:和URLDataSource相同，但是它用來從磁碟抓取資料屬性：basePath、encoding

JdbcDatasource：預設資料來源和SqlEntityProcessor聯合使用

URLDataSource：XPathEntityProcessor

---------------------------------------------------------------------------------------

Entity Processors

屬性名：

datasource

name:requerid,唯一識別一個實體。

processor：預設是SqlEntityProcessor當不是關係型資料庫時是必須的含有的屬性

onError：abort|skip|continue,廢棄|跳過|繼續使用該document

preimportDeleteQuery:在完全匯入資料之前，來清除索引

postImportDeleteQuery:匯入資料之後執行

rootEntity：

transformer：Optional。

cacheImpl"SortedMapBackedCache

cacheKey

cacheLookup

where

the sqlentity processor

屬性：

query:requited

deltaQuery:

parentDeltaQuery

deletedPKQuery

deltaImportQuery

The XPathEntityProcessor

<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"

forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true"/>
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="date" xpath="/RDF/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
</entity>
</document>
</dataConfig>

The MailEntityProcessor

The TikaEntityProcessor

Solr Indexing（solr 官方文件）

Java 記憶體洩露 Memory Leak（Oracle官方文件）

Mysql優化（出自官方文件） - 第三篇

Mysql優化（出自官方文件） - 第八篇（索引優化系列）

Mysql優化（出自官方文件） - 第九篇（優化資料庫結構篇）

Solr Indexing（solr 官方文件）

spark 調優（官方文件）

HDFS架構指南（Hadoop官方文件翻譯）

Oracle 10g DataGuard 監視主資料庫和備用資料庫（官方文件）

MongoDB內建角色詳解（翻譯自官方文件）

MyBatis 註解（摘自MyBatis官方文件）

十分鐘掌握pandas（pandas官方文件翻譯）

Interface RowMapper 簡介（譯自spring 官方文件）

runspec 的選項說明（spec2006官方文件的翻譯）

Android官方SDK下載（含API文件）

HBase資料庫與關係型資料庫的區別（取材於官方文件）

ES聚合原理：（來源自官方文件）

【Phabricator】教科書一般的Phabricator安裝教程（配合官方文件並帶有踩坑解決方案）

C語言 Include指令（引用頭文件）

遞歸遍歷某個文件夾（包括子文件）中的左右內容

python模塊之StringIO/cStringIO（內存文件）

Solr Indexing（solr 官方文件）

相關推薦