1. 程式人生 > >Oracle模糊查詢之(5.3認識全文索引之全文索引的原理)Oracle全文檢索方面的研究(全) [主文]

Oracle模糊查詢之(5.3認識全文索引之全文索引的原理)Oracle全文檢索方面的研究(全) [主文]

參考百度文件:

1、準備流程        

1.1檢查和設定資料庫角色

首先檢查資料庫中是否有CTXSYS使用者和CTXAPP腳色。如果沒有這個使用者和角色,意味著你的資料庫建立時未安裝intermedia功能。你必須修改資料庫以安裝這項功能。 預設安裝情況下,ctxsys使用者是被鎖定的,因此要先啟用ctxsys的使用者。

預設ctxsys使用者是被鎖定的且密碼即時失效,所以我們以sys使用者進入em,然後修改ctxsys使用者的狀態和密碼。如圖:

1.2 賦權 

                測試使用者以之前已經建好的foo使用者為例,以該使用者下的T_DOCNEWS為例

先以sys使用者dba身份登入,對foo賦resource,connect許可權

GRANT resource, connect  to foo;

再以ctxsys使用者登入並對foo使用者賦權

GRANT  ctxapp  TO foo;

GRANT execute ON ctxsys. ctx_cls  TO foo;

GRANT execute ON ctxsys. ctx_ddl  TO foo;

GRANT execute ON ctxsys. ctx_doc  TO foo;

GRANT execute ON ctxsys. ctx_output TO foo;

GRANT execute ON ctxsys. ctx_query TO foo;

GRANT execute ON ctxsys. ctx_report  TO foo;

GRANT execute ON ctxsys. ctx_thes  TO foo;

GRANT execute ON ctxsys. ctx_ulexer TO foo;

檢視系統預設的oracle text 引數

Select pre_name, pre_object from ctx_preferences

2、Oracle Text 索引原理

Oracle text 索引將文字中所有的字元轉化成記號(token),如www.taobao.com 會轉化

成www,taobao,com 這樣的記號。

Oracle10g 裡面支援四種類型的索引,context,ctxcat,ctxrule,ctxxpath

2.1 Context 索引

Oracle text 索引把全部的word 轉化成記號,context 索引的架構是反向索引(inverted

index),每個記號都對映著包含它自己的文字位置,如單詞dog 可能會有如下的條目

這表示dog 在文件doc1,doc3,doc5 中都出現過。索引建好之後,系統中會自動產生

如下DR$MYINDEX$I,DR$MYINDEX$K,DR$MYINDEX$R,DR$MYINDEX$X,MYTABLE5 個表(假設表為

mytable, 索引為myindx) 。Dml 操作後, context 索引不會自動同步, 需要利用

ctx_ddl.sync_index 手工同步索引。

例子:

Create table docs (id number primary key, text varchar2(200));

Insert into docs values(1, '<html>california is a state in the us.</html>');

Insert into docs values(2, '<html>paris is a city in france.</html>');

Insert into docs values(3, '<html>france is in europe.</html>');

Commit;

/

--建立context 索引

Create index idx_docs on docs(text)

indextype is ctxsys.context parameters

('filter ctxsys.null_filter section group ctxsys.html_section_group');

--查詢

Column text format a40;     --字串截為40位顯示。

Select id, text from docs where contains(text, 'france') > 0;

id text

---------- -------------------------------

3 <html>france is in europe.</html>

2 <html>paris is a city in france.</html>

--繼續插入資料

Insert into docs values(4, '<html>los angeles is a city in california.</html>');

Insert into docs values(5, '<html>mexico city is big.</html>');

commit;

Select id, text from docs where contains(text, 'city') > 0;--新插入的資料沒有查詢到

id text

--------------------------------------------

2 <html>paris is a city in france.</html>

--索引同步

begin

ctx_ddl.sync_index('idx_docs', '2m');  --使用2M同步索引

end;

--查詢

Column text format a50;

Select id, text from docs where contains(text, 'city') > 0; --查到資料

id text

-----------------------------------------------

5 <html>mexico city is big.</html>

4 <html>los angeles is a city in california.</html>

2 <html>paris is a city in france.</html>

-- or 操作符

Select id, text from docs where contains(text, 'city or state ') > 0;

--and 操作符

Select id, text from docs where contains(text, 'city and state ') > 0;

或是

Select id, text from docs where contains(text, 'city state ') > 0;

--score 表示得分,分值越高,表示查到的資料越精確

SELECT SCORE(1), id, text FROM docs WHERE CONTAINS(text, 'oracle', 1) > 0;

Context 型別的索引不會自動同步,這需要在進行Dml 後,需要手工同步索引。與context 索引相對於的查詢操作符為contains

2.2 Ctxcat 索引

用在多列混合查詢中

Ctxcat 可以利用index set 建立一個索引集,把一些經常與ctxcat 查詢組合使用的查詢列新增到索引集中。比如你在查詢一個商品名時,還需要查詢生產日期,價格,描述等,你可可以將這些列新增到索引集中。oracle 將這些查詢封裝到catsearch 操作中,從而提高全文索引的效率。在一些實時性要求較高的交易上,context 的索引不能自動同步顯然是個問題,ctxcat則會自動同步索引

例子:

Create table auction(Item_id number,Title varchar2(100),Category_id number,Price number,Bid_close date);

Insert into auction values(1, 'nikon camera', 1, 400, '24-oct-2002');

Insert into auction values(2, 'olympus camera', 1, 300, '25-oct-2002');

Insert into auction values(3, 'pentax camera', 1, 200, '26-oct-2002');

Insert into auction values(4, 'canon camera', 1, 250, '27-oct-2002');

Commit;

/

--確定你的查詢條件(很重要)

--Determine that all queries search the title column for item descriptions

--建立索引集

begin

ctx_ddl.create_index_set('auction_iset');

ctx_ddl.add_index('auction_iset','price'); /* sub-index a*/

end;

--建立索引

Create index auction_titlex on auction(title) indextype is ctxsys.ctxcat

parameters ('index set auction_iset');

Column title format a40;

Select title, price from auction where catsearch(title, 'camera', 'order by price')> 0;

Title price

--------------- ----------

Pentax camera 200

Canon camera 250

Olympus camera 300

Nikon camera 400

Insert into auction values(5, 'aigo camera', 1, 10, '27-oct-2002');

Insert into auction values(6, 'len camera', 1, 23, '27-oct-2002');

commit;

/

--測試索引是否自動同步

Select title, price from auction where catsearch(title, 'camera',

'price <= 100')>0;

Title price

--------------- ----------

aigo camera 10

len camera 23

新增多個子查詢到索引集:

begin

ctx_ddl.drop_index_set('auction_iset');

ctx_ddl.create_index_set('auction_iset');

ctx_ddl.add_index('auction_iset','price'); /* sub-index A */

ctx_ddl.add_index('auction_iset','price, bid_close'); /* sub-index B */

end;

drop index auction_titlex;

Create index auction_titlex on auction(title) indextype is ctxsys.ctxcat

parameters ('index set auction_iset');

SELECT * FROM auction WHERE CATSEARCH(title, 'camera','price = 200 order by bid_close')>0;

SELECT * FROM auction WHERE CATSEARCH(title, 'camera','order by price, bid_close')>0;

任何的Dml 操作後,Ctxcat 的索引會自動進行同步,不需要手工去執行,與ctxcat 索引相對應的查詢操作符是catsearch.

語法:

Catsearch(

[schema.]column,

Text_query varchar2,

Structured_query varchar2,

Return number;

例子:

catsearch(text, 'dog', 'foo > 15')

catsearch(text, 'dog', 'bar = ''SMITH''')

catsearch(text, 'dog', 'foo between 1 and 15')

catsearch(text, 'dog', 'foo = 1 and abc = 123')

2.3 Ctxrule 索引

The function of a classification application is to perform some action based on document content.

These actions can include assigning a category id to a document or sending the document to a user.

The result is classification of a document.

例子:

Create table queries (query_id number,query_string varchar2(80));

insert into queries values (1, 'oracle');

insert into queries values (2, 'larry or ellison');

insert into queries values (3, 'oracle and text');

insert into queries values (4, 'market share');

commit;

Create index queryx on queries(query_string) indextype is ctxsys.ctxrule;

Column query_string format a35;

Select query_id,query_string from queries

where matches(query_string,

'oracle announced that its market share in databases

increased over the last year.')>0;

query_id query_string

---------- -----------------------------------

1 oracle

4 market share

在一句話中建立索引匹配查詢

2.4 Ctxxpath 索引

Create this index when you need to speed up existsNode() queries on an XMLType column

3. 索引的內部處理流程

3.1 Datastore 屬性

資料檢索負責將資料從資料儲存(例如 web 頁面、資料庫大型物件或本地檔案系統)

中取出,然後作為資料流傳送到下一個階段。Datastore 包含的型別有Direct datastore,

Multi_column_datastore, Detail_datastore, File_datastore, Url_datastore, User_datastore,

Nested_datastore。

3.1.1.Direct datastore

支援儲存資料庫中的資料,單列查詢.沒有attributes 屬性

支援型別:char, varchar, varchar2, blob, clob, bfile,or xmltype.

例子:

Create table mytable(id number primary key, docs clob);

Insert into mytable values(111555,'this text will be indexed');

Insert into mytable values(111556,'this is a direct_datastore example');

Commit;

--建立 direct datastore

Create index myindex on mytable(docs)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore');

Select * from mytable where contains(docs, 'text') > 0;

3.1.2.Multi_column_datastore

適用於索引資料分佈在多個列中

the column list is limited to 500 bytes

支援number 和date 型別,在索引之前會先轉化成textt

raw and blob columns are directly concatenated as binary data.

不支援long, long raw, nchar, and nclob, nested table

Create table mytable1(id number primary key, doc1 varchar2(400),doc2 clob,doc3

clob);

Insert into mytable1 values(1,'this text will be indexed','following example creates amulti-column ','denotes that the bar column ');

Insert into mytable1 values(2,'this is a direct_datastore example','use this datastore when your text is stored in more than one column','the system concatenates the text columns');

Commit;

/

--建立 multi datastore 型別

Begin

Ctx_ddl.create_preference('my_multi', 'multi_column_datastore');

Ctx_ddl.set_attribute('my_multi', 'columns', 'doc1, doc2, doc3');

End;

--建立索引

Create index idx_mytable on mytable1(doc1)indextype is ctxsys.context

parameters('datastore my_multi')

Select * from mytable1 where contains(doc1,'direct datastore')>0;

Select * from mytable1 where contains(doc1,'example creates')>0;

注意:檢索時,檢索詞對英文,必須是有意義的詞,比如,

Select * from mytable1 where contains(doc1,' more than one column ')>0;

可以查出第二條紀錄,但你檢索more將沒有顯示,因為more在那句話中不是有意義的一個詞。

--只更新從表,看是否能查到更新的資訊

Update mytable1 set doc2='adladlhadad this datastore when your text is stored test' where

id=2;

Begin

Ctx_ddl.sync_index('idx_mytable');

End;

Select * from mytable1 where contains(doc1,'adladlhadad')>0; --沒有記錄

Update mytable1 set doc1='this is a direct_datastore example' where id=2; --更新主表

Begin

Ctx_ddl.sync_index('idx_mytable');--同步索引

End;

Select * from mytable1 where contains(doc1,'adladlhadad')>0; -查到從表的更新

對於多列的全文索引可以建立在任意一列上,但是,在查詢時指定的列必須與索引時指定的

列保持一致,只有索引指定的列發生修改,oracle 才會認為被索引資料發生了變化,僅修改

其他列而沒有修改索引列,即使同步索引也不會將修改同步到索引中.

也就是說,只有更新了索引列,同步索引才能生效,,要更改其他列的同時也要再寫一次即可。

在多列中,對任意一列建立索引即可,更新其他列的同時,在update那個列,同步索引一次即可看到效果了。

3.1.3 Detail_datastore

適用於主從表查詢(原文:use the detail_datastore type for text stored directly in the database in

detail tables, with the indexed text column located in the master table)

因為真正被索引的是從表上的列,選擇主表的那個列作為索引並不重要,但是選定之後,查

詢條件中就必須指明這個列

主表中的被索引列的內容並沒有包含在索引中

DETAIL_DATASTORE 屬性定義

例子:

create table my_master –建立主表

(article_id number primary key,author varchar2(30),title varchar2(50),body varchar2(1));

create table my_detail –建立從表

(article_id number, seq number, text varchar2(4000),

constraint fr_id foreign key (ARTICLE_ID) references my_master (ARTICLE_ID));

--模擬資料

insert into my_master values(1,'Tom','expert on and on',1);

insert into my_master values(2,'Tom','Expert Oracle Database Architecture',2);

commit;

insert into my_detail values(1,1,'Oracle will find the undo information for this transaction

either in the cached

undo segment blocks (most likely) or on disk ');

insert into my_detail values(1,2,'if they have been flushed (more likely for very large

transactions).');

insert into my_detail values(1,3,'LGWR is writing to a different device, then there is no

contention for

redo logs');

insert into my_detail values(2,1,'Many other databases treat the log files as');

insert into my_detail values(2,2,'For those systems, the act of rolling back can be

disastrous');

commit;

--建立 detail datastore

begin

ctx_ddl.create_preference('my_detail_pref', 'DETAIL_DATASTORE');

ctx_ddl.set_attribute('my_detail_pref', 'binary', 'true');

ctx_ddl.set_attribute('my_detail_pref', 'detail_table', 'my_detail');

ctx_ddl.set_attribute('my_detail_pref', 'detail_key', 'article_id');

ctx_ddl.set_attribute('my_detail_pref', 'detail_lineno', 'seq');

ctx_ddl.set_attribute('my_detail_pref', 'detail_text', 'text');

end;

--建立索引

CREATE INDEX myindex123 on my_master(body) indextype is ctxsys.context

parameters('datastore my_detail_pref');

select * from my_master where contains(body,'databases')>0

--只更新從表資訊,看是否還能查到

update my_detail set text='undo is generated as a result of the DELETE, blocks are modified,

and redo is sent over to

the redo log buffer' where article_id=2 and seq=1

begin

ctx_ddl.sync_index('myindex123','2m'); --同步索引

end;

select * from my_master where contains(body,'result of the DELETE')>0 –沒有查到剛才的更新

--跟新從表後,更新主表資訊

update my_master set body=3 where body=2

begin

ctx_ddl.sync_index('myindex123','2m');

end;

select * from my_master where contains(body,'result of the DELETE')>0 –查到資料

如果更新了子表中的索引列,必須要去更新主表索引列來使oracle 認識到被索引資料發生變

化(這個可以通過觸發器來實現)。

3.1.4 File_datastore

適用於檢索本地伺服器上的檔案(原文:The FILE_DATASTORE type is used for text stored in

files accessed through the local file system.)

多個路徑標識:Unix 下冒號分隔開如path1:path2:pathn Windows 下用分號;分隔開

create table mytable3(id number primary key, docs varchar2(2000));

insert into mytable3 values(111555,'1.txt');

insert into mytable3 values(111556,'1.doc');

commit;

--建立 file datastore

begin

ctx_ddl.create_preference('COMMON_DIR2','FILE_DATASTORE');

ctx_ddl.set_attribute('COMMON_DIR2','PATH','D:\search');

end;

--建立索引

create index myindex3 on mytable3(docs) indextype is ctxsys.context parameters ('datastore COMMON_DIR2');

select * from mytable3 where contains(docs,'word')>0; --查詢

--暫時測試支援doc,txt

3.1.5 Url_datastore

適用於檢索internet 上的資訊,資料庫中只需要儲存相應的url 就可以

例子:

create table urls(id number primary key, docs varchar2(2000));

insert into urls values(111555,'http://context.us.oracle.com');

insert into urls values(111556,'http://www.sun.com');

insert into urls values(111557,'http://www.itpub.net');

insert into urls values(111558,'http://www.ixdba.com');

commit;

/

--建立url datastore

begin

ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');

ctx_ddl.set_attribute('URL_PREF','Timeout','300');

end;

--建立索引

create index datastores_text on urls (docs) indextype is ctxsys.context parameters

( 'Datastore URL_PREF' );

select * from urls where contains(docs,'Aix')>0

若相關的url 不存在,oracle 並不會報錯,只是查詢的時候找不到資料而已。

oracle 中僅僅儲存被索引文件的url 地址,如果文件本身發生了變化,必須要通過修改索引

列(url 地址列)的方式來告知oracle,被索引資料已經發生了變化。

3.1.6.User_datastore

Use the USER_DATASTORE type to define stored procedures that synthesize documents during

indexing. For example, a user procedure might synthesize author, date, and text columns into one

document to have the author and date information be part of the indexed text.

3.1.7 Nested_datastore

全文索引支援將資料儲存在巢狀表中

3.1.8.參考指令碼

--建立direct_store

Create index myindex on mytable(docs)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore');

--建立mutil_column_datastore

Begin

Ctx_ddl.create_preference('my_multi', 'multi_column_datastore');

Ctx_ddl.set_attribute('my_multi', 'columns', 'doc1, doc2, doc3');

End;

Create index idx_mytable on mytable1(doc1)indextype is ctxsys.context

parameters('datastore my_multi')

--建立file_datafilestore

begin

ctx_ddl.create_preference('COMMON_DIR','FILE_DATASTORE');

ctx_ddl.set_attribute('COMMON_DIR','PATH','/opt/tmp');

end;

create index myindex on mytable1(docs) indextype is ctxsys.context parameters ('datastore

COMMON_DIR');

--建立url_datastore

begin

ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');

ctx_ddl.set_attribute('URL_PREF','Timeout','300');

end;

create index datastores_text on urls (docs) indextype is ctxsys.context parameters

( 'Datastore URL_PREF' );

3.2 Filter 屬性

過濾器負責將各種檔案格式的資料轉換為純文字格式,索引管道中的其他元件只能處理純文字資料,不能識別 microsoft word 或 excel 等檔案格式,filter 有charset_filter、

inso_filter、null_filter、user_filter、procedure_filter 幾種型別。(可將文件格式轉化為資料庫文字格式等。)

3.2.1 CHARSET_FILTER

把文件從非資料庫字元轉化成資料庫字元(原文:Use the CHARSET_FILTER to convert

documents from a non-database character set to the character set used by the database)

例子:

create table hdocs ( id number primary key, fmt varchar2(10), cset varchar2(20),

text varchar2(80)

);

begin

cxt_ddl.create.preference('cs_filter', 'CHARSET_FILTER');

ctx_ddl.set_attribute('cs_filter', 'charset', 'UTF8');

end

insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt');

insert into hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt');

commit;

create index hdocsx on hdocs(text) indextype is ctxsys.context

parameters ('datastore ctxsys.file_datastore

filter cs_filter

format column fmt

charset column cset');

3.2.2 NULL_FILTER

預設屬性,不進行任何過濾

oracle 不建議對html、xml 和plain text 使用auto_filter 引數,oracle 建議你使用

null_filter 和section group type

--建立null filter

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.html_section_group');

Filter 的預設值會受到索引欄位型別和datastore 的型別的影響,對於儲存在資料庫中的

varchar2、char 和clob 欄位中的資料,oracle 自動選擇了null_filtel,若datastore 的屬性設定為

file_datastore,oracle 會選擇 auto_filter 作為預設值。

3.2.3 AUTO_FILTER

通用的過濾器,適用於大部分文件,包括PDF 和Ms word,過濾器還會自動識別出plain-text, HTML, XHTML,

SGML 和XML 文件

Create table my_filter (id number, docs varchar2(1000));

Insert into my_filter values (1, 'Expert Oracle Database Architecture.pdf');

Insert into my_filter values (2, '1.txt');

Insert into my_filter values (3, '2.doc');

commit;

/

--建立 file datastore

Begin

ctx_ddl.create_preference('test_filter', 'file_datastore');

ctx_ddl.set_attribute('test_filter', 'path', '/opt/tmp');

End;

--錯誤資訊表

select * from CTX_USER_INDEX_ERRORS

--建立 auto filter

Create index idx_m_filter on my_filter (docs) indextype is ctxsys.context

parameters ('datastore test_filter filter ctxsys.auto_filter');

select * from my_filter where contains(docs,'oracle')>0

AUTO_FILTER 能自動識別出大部分格式的文件,我們也可以顯示的通過column 來指定文件型別,有text,binary,ignore,設定為binary 的文件使用auto_filter,設定為text 的文件使用null_filter,設定為ignore的文件不進行索引。

create table hdocs (id number primary key,fmt varchar2(10),text varchar2(80));

insert into hdocs values(1, 'binary', '/docs/myword.doc');

insert in hdocs values (2, 'text', '/docs/index.html');

insert in hdocs values (2, 'ignore', '/docs/1.txt');

commit;

create index hdocsx on hdocs(text) indextype is ctxsys.context

parameters ('datastore ctxsys.file_datastore filter ctxsys.auto_filter format column

fmt');

3.2.4 MAIL_FILTER

通過mail_filter 把RFC-822,RFC-2045 資訊轉化成索引文字

限制:

文件必須是us-ascii

長度不能超過1024bytes

document must be syntactically valid with regard to RFC-822

3.2.5 USER_FILTER

Use the USER_FILTER type to specify an external filter for filtering documents in a column

3.2.6 PROCEDURE_FILTER

Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called

each time a document needs to be filtered.

3.2.7 參考指令碼

--建立null filter

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.html_section_group');

--建立 auto filter

Create index idx_m_filter on my_filter (docs) indextype is ctxsys.context

parameters ('datastore test_filter filter ctxsys.auto_filter');

Filter 錯誤記錄表:CTX_USER_INDEX_ERRORS

3.3 Lexer 屬性

                Oracle 全文檢索的lexer 屬性用於處理各種不同的語言,最基本的英文使用basic_lexer,

中文則可以使用chinese_vgram_lexer 或chinese_lexer。

3.3.1 Basic_lexer

basic_lexer 屬性支援如英語、德語、荷蘭語、挪威語、瑞典語等以空格作為界限的語言(原

文:Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all

other supported whitespace-delimited languages.)

Create table my_lex (id number, docs varchar2(1000));

Insert into my_lex values (1, 'this is a example for the basic_lexer');

Insert into my_lex values (2, 'he following example sets Printjoin characters ');

Insert into my_lex values (3, 'To create the INDEX with no_theme indexing and with printjoins characters');

Insert into my_lex values (4, '中華人民共和國');

Insert into my_lex values (5, '中國淘寶軟體');

Insert into my_lex values (6, '測試basic_lexer 是否支援中文');

Commit;

/

--建立basic_lexer

begin

ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符號

ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');

ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES');

ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --區分大小寫

end;

create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer

mylex');

Select id from my_lex where contains(docs, 'no_theme') > 0;

select docs from my_lex where contains(docs,'中國')>0

3.3.2 Mutil_lexer

支援多種語言的文件,比如你可以利用這個lexer 來定義包含Endlish,German 和Japanese 的

文件(原文:Use MULTI_LEXER to index text columns that contain documents of different

languages. For example, you can use this lexer to index a text column that stores English, German,

and Japanese documents.)建立一個multi_lexer 屬性的索引,並通過language 列設定需要索

引的語言,Oracle 會根據language 列的內容去匹配add_sub_lexer 過程中指定的語言識別符號,如果匹配的上,就使用該sub_lexer 作為索引的lexer,如果沒有找到匹配的,就使用default語言作為索引的lexer 列,注意客戶端nls_language,可能會影響lexer 的選擇

Select * from v$nls_parameters where parameter = 'NLS_LANGUAGE';

alter session set nls_language='simplified chinese';

alter session set nls_language='american';

例子:

create table globaldoc ( doc_id number primary key,lang varchar2(3),text clob);

--建立multi_lexer

begin

ctx_ddl.create_preference('english_lexer','basic_lexer');

ctx_ddl.set_attribute('english_lexer','index_themes','yes');

ctx_ddl.set_attribute('english_lexer','theme_language','english');

ctx_ddl.create_preference('german_lexer','basic_lexer');

ctx_ddl.set_attribute('german_lexer','composite','german');

ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');

ctx_ddl.create_preference('global_lexer', 'multi_lexer');

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');

ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');

end;

create index globalx on globaldoc(text) indextype is ctxsys.context

parameters ('lexer global_lexer language column lang');

3.3.3 chinese_vgram_lexer 和chinese_lexer

basic_lexer 只能識別出被空格、標點和回車符分隔出來的部分,如果要對中文內容進行索引的話,就必須使用chinese_vgram_lexer 或是chinese_lexer

Chinese_lexer 相比chinese_vgram_lexer 有如下的優點:

產生的索引更小

更好的查詢響應時間

產生更接近真實的索引切詞,使得查詢精度更高

支援停用詞

因為chinese_lexer 採用不同的演算法來標記tokens, 建立索引的時間要比chinese_vgram_lexer

長.

字符集:支援al32utf8,zhs16cgb231280,zhs16gbk,zhs32gb18030,zht32euc,zht16big5

zht32tris, zht16mswin950,zht16hkscs,utf8

--建立chinese lexer

Begin

ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');

ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');

End;

-- chinese_vgram_lexer

Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context Parameters ('lexer foo.my_chinese_vgram_lexer');

Select * from my_lex t where contains(docs, '中國') > 0;

-- chinese_lexer

drop   index ind_m_lex1 force; 

Create index ind_m_lex2 on my_lex(docs) indextype is ctxsys.context

Parameters ('lexer ctxsys.my_chinese_lexer');

Select * from my_lex t where contains(docs, '中國') > 0;

3.3.4 User_lexer

Use USER_LEXER to plug in your own language-specific lexing solution. This enables you to

define lexers for languages that are not supported by Oracle Text. It also enables you to define a

new lexer for a language that is supported but whose lexer is inappropriate for your application.

3.3.5 Default_lexer

如果資料庫在建立的時候指定的是中文則default_lexer 為chinese_vgram_lexer,如果是英文,則default_lexer 為basic_lexer

3.3.6 Query_procedure

This callback stored procedure is called by Oracle Text as needed to tokenize words in the query.

A space-delimited group of characters (excluding the query operators) in the query will be

identified by Oracle Text as a word.

3.3.7 參考指令碼

--建立basic_lexer

begin

ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符號

ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --區分大小寫

end;

create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer

mylex');

--建立 chinese_vgram_lexer 或是chinese_lexer

Begin

ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');

ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');

End;

-- chinese_vgram_lexer

Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context

Parameters ('lexer ctxsys.my_chinese_vgram_lexer');

3.4 Section Group 屬性

Section group 支援查詢包含內部結構的文件(如html、xml 文件等),可以指定對文件

的某一部分進行查詢,你可以將查詢範圍限定在標題head 中。在html、xml 等類似結構的文

檔中,除了用來顯示的內容外,還包括了大量用於控制結構的標識,而這些標識可能是不希望被索引的,這就是section group 的一個主要功能(原文:In order to issue WITHIN queries on document sections, you must create a section group before you define your sections)

3.4.1 Null_section_group

系統預設,不進行任何節的過濾

例子:

Create table my_sec (id number, docs varchar2(100));

Insert into my_sec values (1, 'a simple section group, test null_section_group attribute.');

Insert into my_sec values (2, 'this record one, can be query in nornal');

Insert into my_sec values (4, 'this record

are tested for

the query in paragraph');

Commit;

/

--定義null_section_group

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group ctxsys.null_section_group');

Select * from my_sec where contains(docs, 'record and query') > 0;

--要預先定義sentence 或paragraph',否則查詢會出錯

Select * from my_sec where contains(docs, '(record and query) within sentence') > 0;

Begin

ctx_ddl.create_section_group('test_null', 'null_section_group');

ctx_ddl.add_special_section('test_null', 'sentence');

ctx_ddl.add_special_section('test_null', 'paragraph');

End;

drop index ind_m_sec;

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group test_null');

Select * from my_sec where contains(docs, '(record and query) within sentence') > 0;

Select * from my_sec where contains(docs, '(record and query) within paragraph') > 0;

3.4.2 Basic_section_group

basic_section_group 才是支援節搜尋的最基礎的一種屬性,但是它只支援以<tag>開頭以

</tag>結尾的結構的文件

Create table my_sec1 (id number, docs varchar2(1000));

Insert into my_sec1 values (1, '<heading>title</heading>

<context>this is the contents of the example.

Use this example to test the basic_section_group.</context>');

Insert into my_sec1 values (2, '<heading>example</heading>

<context>this line incluing the word title too.</context>');

Commit;

/

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context;

Select * from my_sec1 where contains (docs, 'heading') > 0;

--定義basic_section_group

Begin

Ctx_ddl.create_section_group('test_basic', 'basic_section_group');

End;

drop index ind_my_sec1;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

Select * from my_sec1 where contains (docs, 'heading') > 0;

Select * from my_sec1 where contains (docs, 'context') > 0;

Select * from my_sec1 where contains (docs, 'use') > 0;

節搜尋的另一個主要功能就是可以限制查詢的範圍,上面的文件包含了兩部分,標題和正文,

其中標題使用標籤<heading>,正文使用標籤<context>,我們可以對basic_section_group 新增

區域屬性,執行查詢在文件的某個範圍內進行

Drop index ind_my_sec1;

Begin

ctx_ddl.add_zone_section('test_basic', 'head', 'heading');

End;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

Select * from my_sec1 where contains (docs, 'title') > 0;

--在head 裡面查詢

Select * from my_sec1 where contains (docs, 'title within head') > 0;

3.4.3 Html_section_group

Html 文件具有很多不規範的表示方法,oracle 建議使用html_section_group 以便能夠得到更

好的識別

--定義html_section_group

begin

ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');

end;

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group htmgroup');

無論是field_section 還是zone_section,表示文件的tag 標籤都是大小寫敏感的,其大小寫需

要和原文中匹配

3.4.4.Xml_section_group

Xml 文件的格式要求比html 文件嚴謹、規範, 這也使得xml_section_group 比

html_section_group 具有了更多的功能

例子:

Create table my_sec2 (id number, docs varchar2(1000));

Insert into my_sec2 values (1, 'context.xml');

commit;

/

--定義xml_section_group

Begin

ctx_ddl.create_preference('test_file', 'file_datastore');

ctx_ddl.set_attribute('test_file', 'path', '/opt/tmp');

ctx_ddl.create_section_group('test_html', 'html_section_group');

ctx_ddl.create_section_group('test_xml', 'xml_section_group');

End;

Create index ind_t_docs on my_sec2 (docs) indextype is ctxsys.context

parameters('datastore ctxsys.test_file filter ctxsys.null_filter section group

ctxsys.test_xml')

Begin

ctx_ddl.add_attr_section('test_xml', 'name', '[email protected]');

End;

Select * from my_sec2 where contains (docs, 'complete within name') > 0;

3.4.5.Auto_section_group

Xml_section_group 的增強型,對於xml_section_group 使用者需要自己新增需要定義的節組,

而使用auto_section_group,則oracle 會自動新增節組以及屬性資訊

3.4.6 Path_section_group

和auto_section_group 十分類似,path_section_group 比auto_section_group 增加了haspath 和

inpath 操作,但是path_section_group 不支援add_stop_section 屬性

3.4.7 參考指令碼

--建立null_section_group

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group ctxsys.null_section_group');

--建立basic_section_group

Begin

Ctx_ddl.create_section_group('test_basic', 'basic_section_group');

End;

Begin

ctx_ddl.add_zone_section('test_basic', 'head', 'heading'); --設定節查詢

End;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

--建立Html_section_group

begin

ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');

end;

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group htmgroup');

--建立Xml_section_group

Begin

ctx_ddl.create_section_group('test_xml', 'xml_section_group');

End;

Create index ind_t_docs on my_sec2 (docs) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.test_xml')

3.5 Storage 屬性

Oracle 全文檢索通常會生成一系列的輔助表,生成規則是dr$+索引名+$+表用途標識,

由於這些表是oracle 自動生成的,通常沒有辦法為這些表指定儲存空間。為構造text 索引所

生成的輔助表指定表空間、儲存引數(use the storage preference to specify tablespace and

creation parameters for tables associated with a text index),oracle 提供了單一的儲存型別

basic_storage。  

在mytable1 表中建立了全文索檢索myindex,系統中會自動產生如下5 個表:

DR$MYINDEX$I,DR$MYINDEX$K,DR$MYINDEX$R,DR$MYINDEX$X,MYTABLE1

參考指令碼

--建立basic storage

Begin

Ctx_ddl.create_preference('mystore', 'basic_storage'); --建立storage

Ctx_ddl.set_attribute('mystore', --設定引數

'i_table_clause',

'tablespace foo storage (initial 1k)');

Ctx_ddl.set_attribute('mystore',

'k_table_clause',

'tablespace foo storage (initial 1k)');

Ctx_ddl.set_attribute('mystore',

'r_table_clause',

'tablespace users storage (initial 1k) lob

(data) store as (disable storage in row cache)');

Ctx_ddl.set_attribute('mystore',

'n_table_clause',

'tablespace foo storage (initial 1k)');

Ctx_ddl.set_attribute('mystore',

'i_index_clause',

'tablespace foo storage (initial 1k) compress 2');

Ctx_ddl.set_attribute('mystore',

'p_table_clause',

'tablespace foo storage (initial 1k)');

End;

--建立索引

Create index indx_m_word on my_word(docs) indextype is ctxsys.context

parameters('storage mystore');

3.6 Wordlist 屬性

Oracle 全文檢索的wordlist 屬性用來設定模糊查詢和同詞根查詢,wordlist 屬性還支援

子查詢和字首查詢,oracle 的wordlist 屬性只有basic_wordlist 一種(原文:Use the wordlist

preference to enable the query options such as stemming, fuzzy matching for your language. You

can also use the wordlist preference to enable substring and prefix indexing, which improves

performance for wildcard queries with CONTAINS and CATSEARCH.)

3.6.1 例子:

Create table my_word (id number, docs varchar2(1000));

Insert into my_word values (1, 'Specify the stemmer used for word stemming in Text queries');

Insert into my_word values (2, 'Specify which fuzzy matching routines are used for the

column');

Insert into my_word values (3, 'Fuzzy matching is currently supported for English');

Insert into my_word values (4, 'Specify a default lower limit of fuzzy score. Specify a

number between 0 and 80');

Insert into my_word values (5, 'Specify TRUE for Oracle Text to create a substring index

matched.');

commit;

/

--建立wordlist

Begin

ctx_ddl.drop_preference('mywordlist');

ctx_ddl.create_preference('mywordlist', 'basic_wordlist');

ctx_ddl.set_attribute('mywordlist','fuzzy_match','english'); --模糊匹配,英語

ctx_ddl.set_attribute('mywordlist','fuzzy_score','0'); --匹配得分

ctx_ddl.set_attribute('mywordlist','fuzzy_numresults','5000');

ctx_ddl.set_attribute('mywordlist','substring_index','true'); --左查詢,適用%to,%to%

ctx_ddl.set_attribute('mywordlist','stemmer','english'); --詞根

ctx_ddl.set_attribute('mywordlist', 'prefix_index', 'true'); --右查詢,適用t0%

End;

Create index indx_m_word on my_word(docs) indextype is ctxsys.context

parameters('wordlist mywordlist');

--例子

Select docs from my_word where contains(docs,'$match')>0 ; --詞根查詢

Select docs from my_word where contains(docs,'MA%')>0; --匹配查詢

3.6.2 document 上的例子

create table quick( quick_id number primary key, text varchar(80) );

--- insert a row with 10 expansions for 'tire%'

insert into quick ( quick_id, text )

values ( 1, 'tire tirea tireb tirec tired tiree tiref tireg tireh tirei tirej');

commit;

/

begin

Ctx_Ddl.Create_Preference('wildcard_pref', 'BASIC_WORDLIST');

ctx_ddl.set_attribute('wildcard_pref', 'wildcard_maxterms', 100) ;

end;

/

create index wildcard_idx on quick(text) indextype is ctxsys.context

parameters ('Wordlist wildcard_pref') ;

select quick_id from quick where contains ( text, 'tire%' ) > 0;

drop index wildcard_idx ;

begin

Ctx_Ddl.Drop_Preference('wildcard_pref');

Ctx_Ddl.Create_Preference('wildcard_pref', 'BASIC_WORDLIST');

ctx_ddl.set_attribute('wildcard_pref', 'wildcard_maxterms', 5) ;--限制最大的匹配數,如

果超過這個數量,查詢出現報錯

end;

/

create index wildcard_idx on quick(text) indextype is ctxsys.context

parameters ('Wordlist wildcard_pref') ;

select quick_id from quick where contains ( text, 'tire%' ) > 0;

3.6.3.參考指令碼

--建立wordlist

begin

ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');

ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX','TRUE'); --定義wordlist 的引數

end;

--刪除wordlist

begin

ctx_ddl.drop_preference('mywordlist');

3.7 Stoplist 屬性

Stoplist 允許遮蔽某些常用的詞,比如is,a,this,對這些詞進行索引用處不大,系統

預設會使用和資料庫語言相對應的停用詞庫(原文:Stoplists identify the words in your

language that are not to be indexed. In English, you can also identify stopthemes that are not to be indexed. By default, the system indexes text using the system-supplied stoplist that corresponds to your database language.),Oracle text 提供最常用的停用詞庫語言包括English, French, German,Spanish, Chinese, Dutch, and Danish

分別有basic_stoplist,empty_stoplist,default_stoplist,multi_stoplist 幾種型別

3.7.1 Basic_stoplist

建立使用者自定義的停用詞庫,文件中關於stoplist 的介紹相當少,只有寥寥的數行

例子:

Create table my_stop (id number, docs varchar2(1000));

Insert into my_stop values (1, 'Stoplists identify the words in