1. 程式人生 > >SVD 與 LSI教程(5):LSI關鍵字研究與協同理論

SVD 與 LSI教程(5):LSI關鍵字研究與協同理論

/**********************作者資訊****************/

Dr. E. Garcia

Mi Islita.com

Email | Last Update: 01/07/07

/**********************作者資訊****************/

Introduction

In Part 1 and 2 of this tutorial we covered the Singular Value Decomposition (SVD) algorithm. In Part 3 and 4 we explained through examples how SVD is used in Latent Semantic Indexing (LSI). We mentioned how the U, S and V

 matrices and truncated matrices adopt a meaning not found in plain SVD implementations.

First, we demonstrated that rows of V (or columns of VT) holds document vector coordinates. Thus, any two documents can be compared by computing cosine similarities. This information can be used to group documents (clustering), to classify documents by topics (topic analysis) or to construct collections of similar documents (directories).

Second, we have shown that the diagonal of S represents dimensions. These dimensions are used to embed vectors representing documents, queries, and terms.

Third, we indicated that rows of U holds term vector coordinates. Thus, any two terms can be compared by computing cosine similarities. With this information one should be able to conduct keyword research studies. Such studies could include the construction of a thesaurus, generation of lists of candidate terms to be used in web documents or in a keyword bidding program.

In this article we want to address the last point; i.e., how readers could use U or Uk for keyword research. Since such studies are intimate linked to word usage and co-occurrence, we want to explain the role of keyword co-occurrence in the term-term LSI matrix. In particular we want to explain how co-occurrence affects LSI scores. In the process we want to debunk another SEO myth: the claim made by some SEO "experts" that in order to make documents "LSI-friendly, -ready or -compliant" these must be stuffed with synonyms or related terms.

We assume that readers have assimilated previous tutorials. If you haven't done this please STOP AND READ THESE since we will be reusing concepts and examples already discussed.

You might also find useful the following fast tracks:

These are designed to serve as quick references for readers of this series.

Revisiting the Uk Matrix

As mentioned before, rows of Uk hold term vector coordinates. Thus, keyword research can be conducted with LSI by computing term-term cosine similarities.

Luckly in the example used in Part 4 we worked with three documents and ended up with a two-dimensional space so we can visualize the vectors. These are shown in Figure 1. For more than three dimensions a visual representation is not possible. We have included the query vector (gold silver truck, in red) to simplify the discussion.

Revisiting the U Matrix

Figure 1. Revisiting the Uk Matrix.


* See important footnote

Note how some terms end grouped in the reduced space. Some vectors end completely superimposed and are not shown.

Now that we have grouped terms, these can be used for several purposes. For instance these can be used in new documents or in ads or to formulate new queries. Also terms around the query can be used to expand or refine the query.

Note that none of these terms are synonyms. We have selected this example to debunk another SEO myth.

Another SEO Myth Debunked: Synonym Stuffing

Let revisit Figure 1 and the original documents:

  • d1: Shipment of gold damaged in a fire.
  • d2: Delivery of silver arrived in a silver truck.
  • d3: Shipment of gold arrived in a truck.

When we look at Figure 1 the first SEO misconception that gets debunked is that LSI groups terms because these happen to maintain a synonymity association. Clearly this is not the case.

One could argue that gold is more related to silver than to shipment. After all, these can be used as adjectives (both are colors) or nouns (both are metals). Why then in this example gold and shipment form a two-term cluster? The term-document matrix (A) reveals why: these co-occur in d1 and d3, but not in d2.

Also note that the vectors associated to silver and delivery are superimposed. A shows these co-occurring in d2 and as being mutually dependent; i.e., one occurs whenever the other occurs.

We can also identify a by-product or direct consequence of using a primitive weight scheme like the Term Count Model: term repetition affects the length of vectors. In d2silver occurs twice and delivery once. This explains why the length of these term vectors is 2:1.

But wait: there is more.

Damage and fire end clustered since they co-occur once in d1, but not in d2 and d3. Arrived and truck are clustered and co-occur once in d2 and d3, but not in d1. Stopwords a, in, of co-occur once in d1, d2 and d3 and are also clustered by the LSI algorithm. Certainly, these stopwords are not synonyms. In all these cases we are dealing with what is known as first-order co-occurrence.

The case of second-order co-occurrence, that is, two terms not co-occurring while co-occurring with a third term is also clear. For instance, gold and silver do not co-occur in any of the three documents. However, they co-occur with truck and as follows:

  1. in d3, gold and truck co-occur, but silver doesn't.
  2. in d2, silver and truck co-occur, but gold doesn't.

It is the presence of these first and second order co-occurrence paths what is at the heart of LSI and makes the technique works --not the fact that terms happen to maintain a synonymity or relatedness association. So, where does this "synonym myth" comes from?

In the early LSI papers the role of first and high-order co-occurrence patterns was mentioned, but not fully addressed. These papers overemphasized the role of LSI as a synonym discovery technique.

It so happens that synomyns are a special class of tokens that do not tend to occur together, but tend to co-occur in similar contexts (neighboring terms), which is precisely a high-order co-occurrence phenomenon called second-order co-occurrence. The reverse is not necessarily true; not all terms with a second-order co-occurrence relationship are synonyms. Think of this in terms of the following analogy:

Dogs are four-leg animals, but not all four-leg animals are dogs.

It appears that search marketers looked at one way of the issue and then arrived to a fallacious conclusion. This might explain why many of these tend to misquote outdated papers and even suggest that in order to make documents "LSI-friendly" these should be stuffed with synonyms and related terms.

There is no such thing as "LSI-Friendly" documents

A lot of use of synonyms and related terms in a copy has nothing to do with LSI.

At this DigitalPoint thread I explained that the use of synonyms and related terms is a common sense practice one should use to improve copy style, but not that one should use because of LSI.

Some SEOs are giving the wrong advice by saying that one should use synonyms and related terms under the pretension or wrong thesis that this will make a document "LSI-friendly". In fact, when one think thoroughly there is no such thing as making documents "LSI-friendly". This is another SEO Myth.

The great thing about a phenomenon taking place at a global level like co-occurrence and IDF (inverse document frequency) is that the chances for end users to manipulate these are close to nada, zero, zip, nothing.

In LSI, co-occurrence (especially second-order co-occurrence) is responsible for the LSI scores assigned to terms, not the nature of the terms or whether these happen to be synonyms or related terms. In the early LSI papers this was not fully addressed and emphasis was given to synonyms. Why?

Because the documents selected to conduct those experiments happen to contain synonyms and related terms. It was thought that somehow synonymity associations were responsible for the clustering phenomenon. The fact is that this was direct result of co-occurrence patterns present in the LSI matrix.

Two studies (2, 3) have explained the role of co-occurrence patterns in the LSI matrix, but differ a bit in some of their findings. It seems that SEOs are still quoting the first LSI papers from the late eighties and early nineties and in the process some have stretched that old research in order to market better whatever they sell.

When LSI is applied to a term-document matrix representing a collection of documents in the zillions, the co-occurrence phenomenon that affects the LSI scores becomes a global effect, occuring between documents in the collection.

Thus, the only way that end users (e.g. SEOs) would influence the LSI scores is if they can access and control the content of all the documents of the matrix or launch a coordinated spam attack to the entire collection. The later would be the case of a spammer trying to make an LSI-based search engine to index billion of documents (to say a quantity) he/she has created.

If an end user or research want to understand and manipulate the effect of co-occurrence in a single document, he/she would need to deconstruct a single document and make a term-passage matrix for that single document and to this apply LSI --then play by manipulating single terms. Whatever the results these will only be valid for that universe represented by the matrix, that is for that and only that document.

If such document is then submitted to the LSI-based search engine that local effect simply vanishes and global co-occurrence "takes over" and spreads throughout the collection, forming the corresponding connectivity paths that eventually forces a redistribution of term weights.

Consequently, SEOs that sell this idea of making documents "LSI-friendly", "LSI-ready" or "LSI-compliant" like some firms sending emails that read "is your site LSI optimized?", "we can make your documents LSI-valid" or those that promote the notion of "LSI and Link Popularity" end exposed for what they are and for how much they know about search engines. The sad thing is that these illusion sellers find their way via search engine conferences (SES), blogs and forums to deceive the industry with such blogonomies. In the process they give a black eye to the rest of ethical SEOs/SEMs before the IR community, reinforcing the wide spread perception that search marketers are a bunch of spammers or unscrupulous sales people. BTW here are Two More LSI Blogonomies.

In the next sections we discuss this in more details. In particular we want to explain why some terms gain or lose weight and how first- and second-order co-occurrence paths present in the term-term LSI matrix spread throughout a collection.

Why d2 scores higher than d3?

Revisiting the original documents

  • d2: Delivery of silver arrived in a silver truck.
  • d3: Shipment of gold arrived in a truck.

The query consists of three terms: gold silver truck.

Note that d2 and d3 both match two of these and miss one query term. d2 misses gold and d3 misses silver. Evidently,

  1. the term missed by d3 (silver) is repeated twice in d2, while the term missed by d2 (gold) occurs once in d3.
  2. d2 mentions delivery, which co-occur with silver and truck. Its vector also overlaps partially the silver vector. Note that the vectors for deliver, silver and truck all are close to the query vector. In the case of d3 this mentions shipment, but this term occurs with gold and not with silver or truck. This explains why its vector is far away from the query vector.

This suggests that terms co-occurring with similar neighboring terms are responsible for the observed LSI scores. Whether these happen to be synonyms or not is not the determining factor.

Why d3 scores higher than d1?

A similar reasoning can be used to compare d3 and d1. Revisiting the original documents:

  • d1: Shipment of gold damaged in a fire.
  • d3: Shipment of gold arrived in a truck.

We can see that d1 mentions damaged and fire. These terms co-occur with gold, but never with silver and truck. Note that their vectors are superimposed and far way from the query vector.

In the case of d3 this document mentions arrived and truckArrived co-occurs with silver which is not explicitly present in the document, but is part of the query. The document also mentions truck which definitely is in the query. Arrived and truck also co-occur and their vectors are closer to the query vector. It is then not surprising to find d3 scoring higher than d1.

Let's look now at delivery and shipment. It can be argued that delivery is more related to shipment than to silver. However, delivery and shipment do not co-occur and their vectors tend to end at opposite extremes of the query vector. Again, co-occurrence and not the nature of the terms is the determining factor.

A Quantitative Interpretation using Co-Occurrence

So far we have used co-occurrence arguments to provided a qualitative explanation for the observed LSI scores. Let's reinforce now our main arguments with a quantitative description of the problem.

In Figure 2 we have recomputed A as a Rank 2 Approximation.

Truncated Matrix for the Rank 2 Approximation

Figure 2. Truncated Matrix for the Rank 2 Approximation.


Note that LSI has readjusted matrix A term weights which are now either incremented or lowered in the truncated matrix Ak. Let us underscore that this redistribution is not based on the nature of the terms, whether these happen to be synonyms or related terms, but on the type of co-occurrence between these.

To illustrate, let's take a new look at d2 and d3 using Figure 2.

  • d2: Delivery of silver arrived in a silver truck.
  • d3: Shipment of gold arrived in a truck.

The word silver did not appear in d3, but because d3 did contain arrived and truck and these co-occur with silver its new weight is 0.3960. This is an example of second-order co-occurrence. By contrast, the value 1 for arrived and truck, which appeared once in d3, has been replaced by 0.5974 reflecting the fact that these terms co-occur with a word not present in d3. This represents a lost of contextuality.

A similar reasoning can be used with d1 and d3.

  • d1: Shipment of gold damaged in a fire.
  • d3: Shipment of gold arrived in a truck.

The words arrived and truck did not appear in d1, but these co-occur with the stopwords aof, and in in all three documents, so their weights in d1 are 0.3003.

This redistribution of term weights (addition and substraction) occurring in the truncated LSI matrix is better understood with a side-by-side comparison of terms as illustrated in Figure 3.

Redistribution of Weights

Figure 3. Redistribution of Weights.


Inspect thoroughly this figure. Any little change in a term or terms in any given document will provoke a redistribution of term weights across the entire collection. There is no way for end users to predict that redistribution in other documents of the collection. Since end users don't have access, don't have control over other documents of the collection, and don't know when or how someone (or how many) across the entire collection (or the Web) will make changes to a given document, it would be impossible to predict the final redistribution of weights caused by the LSI algorithm and subsequent ranking at any given point in time. And we still don't know the specific implementation of LSI used by search engines like Google or Yahoo!; eg. how many dimensions were used to truncate the original matrix, which term weight scoring scheme was used to populate the initial term-doc matric, and so forth. Certainly no current search engine uses raw frequencies to assign weights to terms and then use these to rank documents.

From the common sense side, this is why we say that there is no such thing as "LSI-friendly" or "LSI-compliant" documents. How could you predict the redistribution of term weights in the SVD matrix? Exactly.

Therefore, SEO firms claiming that they can make "LSI-friendly" documents or that are selling "LSI Tools", "LSI Videos", "LSI-link popularity", and other form of "LSI-based" services are deceiving the public and prospective clients. Stay away from their businesses or whatever they claim in any SEO Book, Forum, or Blog or in any search engine conference & expo or "advanced" marketing seminar.

Most of these folks don't even know how to SVD a simple matrix and are just about selling something or about promoting their image as SEO "experts". Each time I meet an IR researcher or colleague from the academic world and discuss SVD they simply laugh out laud (LOL) at how these search engine marketers interpret or "explain" LSI and other IR algorithms.

Rant aside, in Figure 3 we have computed row and column totals and a grand total. These deviate from the expected totals. How could we interpret such deviations?

Well, the original term-document data was described by a matrix of rank 3 and embedded in a 3-dimensional space. When we applied the SVD algorithm we removed one dimension and obtained a Rank 2 Approximation. So, the truncated data was embedded in a 2-dimensional space. We assumed that the dimension removed was noisy, so any fluctuation (increment or decrement) occurring in this dimension was taken for noise. For all practical purposes the difference 22 - 21.9611 = 0.0389 can be taken for the net change caused by the SVD algorithm after removing the noise.

Note that only four terms gain weights. These are:

  • arrived: 2.0305 - 2 = + 0.0305
  • gold: 2.0119 - 2 = + 0.0119
  • shipment: 2.00119 - 2 = + 0.0119
  • truck: 2.0305 - 2 = + 0.0305

for an accummulated gain of 0.0848 weight units. All these terms are mentioned in d3, yet the rank order was d2 > d3 > d1, for the reasons previously mentioned. This makes sense since ranks were assigned by comparing query-document cosine similarities, not by net weight changes. Surprise: Net weight gains not necessarily made d3 more relevant that d2! Say "adios" to term manipulation efforts.

Beyond Plain Co-Occurrence: Contextual Co-Occurrence

In this exercise we have shown that what accounts for the redistribution of term weights in the truncated LSI matrix is a co-occurrence phenomenon and not the nature of the terms. In particular we have limited the discussion to co-occurrence of the first and second kind. The jury is still out as to whether higher orders (third, fourth, fifth, and so forth) play a significative role. At least two studies provide contradictory results (2, 3).

Let us stress that these studies, the above example and the results herein discussed have been extracted under controlled conditions, free from ads and spam. Applying these results to Web collections is far more difficult. This is due to the fact that Web documents, especially long documens, tend to discuss different topics and are full of vested interests and alliances of all kind. Such documents can be full of ads, headlines, news feeds, etc.

Thus, only because any two terms happen to be found in the same document this is not evidence of similarity or relatedness. Thus, a simplistic co-occurrence approach is not recommended. This is why the extraction of terms from commercial documents by means of "LSI based" tools is a questionable practice.

It should be pointed out that only because two terms happen to be synonyms or happen to co-occur in a document this is not evidence of contextuality. In this case one must look at terms co-occurring within similar neighboring terms. This is an ongoing research area we are looking into.

Conclusion

This tutorial series presents introductory material on Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI). We have shown how SVD is used in LSI. In the process several SEO myths have been debunked and few fast track tutorials have been provided.

During our journey toward a better understanding of LSI we searched for clues on what makes LSI work. We have shown that co-occurrence seems to be at the heart of LSI, especially when terms co-occur in the same context (with similar neighboring terms). At the time of writing, the role of higher order co-occurrences is still unclear. We are currently looking into a geometrical framework for understanding the redistribution of term weights.

* BlueBit Important Upgrade

Note 1 After this tutorial was written, BlueBit upgraded the SVD calculator and now is giving the VT transpose matrix. We became aware of this today 10/21/06. This BlueBit upgrade doesn't change the calculations, anyway. Just remember that if using VT and want to go back to V just switch rows for columns.

Note 2 BlueBit also uses now a different subroutine and a different sign convention which flips the coordinates of the figures given above. Absolutely none of these changes affect the final calculations and main findings of the example given in this tutorial. Why? 

Tutorial Review

  1. Rework Figure 3, but this time without truncating the term-document matrix. Explain any deviation in the observed totals.
  2. Why some terms receive negative weights in the example given above.
  3. Also in the example given above, explain why some terms end weighting more than in the original matrix.
References

相關推薦

SVD LSI教程5LSI關鍵字研究協同理論

/**********************作者資訊****************/ Dr. E. Garcia Mi Islita.com Email | Last Update: 01/07/07 /**********************作者

SVD LSI 教程4 LSI計算

/**********************作者資訊****************/ Dr. E. Garcia Mi Islita.com Email | Last Update: 01/07/07 /**********************作者

SVD LSI教程3 計算矩陣的全部奇異值

/**********************作者資訊****************/ Dr. E. Garcia Mi Islita.com Email | Last Update: 01/07/07 /**********************

OpenCV Python教程345 直方圖的計算顯示 形態學處理 初級濾波內

OpenCV Python教程(3、直方圖的計算與顯示) 本篇文章介紹如何用OpenCV Python來計算直方圖,並簡略介紹用NumPy和Matplotlib計算和繪製直方圖 直方圖的背景知識、用途什麼的就直接略過去了。這裡直接介紹方法。 計算並顯

【活動預告】NEO區塊鏈公開課5NNS系統設計實現

NEO區塊鏈公開課第5期: 主題:NEL精品課程之NNS系統設計與實現 時間:10月20日13:30—17:00   地點:上海市楊浦區政學路77號INNOSPACE 1樓IPOCLUB   報名連結:http://www.huodongxing.c

Matplotlib資料視覺化5柱狀圖直方圖

  柱狀圖和直方圖是兩種非常類似的統計圖,區別在於: 直方圖展示資料的分佈,柱狀圖比較資料的大小。 直方圖X軸為定量資料,柱狀圖X軸為分類資料。因此,直方圖上的每個條形都是不可移動的,X軸上的區間是連續的、固定的。而柱狀圖上的每個條形是可以隨意排序的,有的情況下需要按照分

基礎測試理論實踐-連載寫的背景動機

注:本文是原創,轉載麻煩務必註明出處。 我第一次接觸測試的概念,是我大三的一次技術通識課上。 我的學校是一所國家級重點大學,在廣東省更是辦學的佼佼者。雖然我的專業,電腦科學與技術不是S學校的強項,但是我們班有很多同學在校期間就拿了ACM的世界金牌。在這種大牛大神的崇尚學習

springCloud5Eureka的元數據Eureka Server的rest端點

springcloud eureka的元數據 eureka server的rest端點 一、Eureka的元數據1.1、簡介Eureka的元數據有兩種:標準元數據和自定義元數據。標準元數據指的是主機名、IP地址、端口號、狀態頁和健康檢查等信息,這些信息都會被發布在服務註冊表中,用於服務之間的調用。

Git 教程倉庫分支

ide 不但 clas version span 右上角 director discard pre 遠程倉庫 到目前為止,我們已經掌握了如何在Git倉庫裏對一個文件進行時光穿梭,你再也不用擔心文件備份或者丟失的問題了。 可是有用過集中式版本控制系統SVN的童鞋會站出來說,這

痞子衡嵌入式第一本Git命令教程5- 提交(commit/format-patch/am)

今天 分布 控制系統 rom end stat 準備工作 多少 cond   今天是Git系列課程第五課,上一課我們做了Git本地提交前的準備工作,今天痞子衡要講的是Git本地提交操作。   當我們在倉庫工作區下完成了文件增刪改操作之後,並且使用git add將文件改動記

Thrift RPC 系列教程5—— 介面設計篇struct & enum設計

好的介面,如同漂亮的美女,是人都會多看一眼。 一個示例 比如,要我們設計一個 User。那很簡單,典型的 class 嘛,按照 OOP 的套路走就行了,於是: struct User{ 1: string id; 2: string name; 3: i64 age;

~雜記5github使用常見注意點總結

1、關於如何使用github,網上有很多文章,其中有一篇寫的很好: https://blog.csdn.net/Hanani_Jia/article/details/7795059 跟著上一篇的步驟,可以成功克隆庫到本地以及上傳檔案。   2、我遇到的問題: (1)上述

【翻譯】CodeMix使用教程任務tasks.json

CodeMix中的任務與tasks.json 工具(如編譯器,連結器和構建系統)用於自動化構建,執行測試和部署等過程。 雖然這些工具通常從IDE外部的命令列執行,但在Tasks支援下,可以在IDE中執行這些程序。 對於執行構建和驗證的工具,這些工具報告的問題由CodeMix選取並顯示在IDE中

API 系列教程結合 Laravel 5.5 和 Vue SPA 基於 jwt-auth 實現 API 認證

上一篇我們簡單演示了 Laravel 5.5 中 RESTful API 的構建、認證和測試,本教程將在上一篇教程的基礎上進行昇華。 我們將結合 Laravel 和 Vue 單頁面應用(SPA),在它們的基礎上引入 jwt-auth 實現 API 認證,由於 Laravel 集成了對 Vue

API 系列教程基於 Laravel 5.5 構建 和 測試 RESTful API

隨著移動開發和 JavaScript 框架的日益流行,使用 RESTful API 在資料層和客戶端之間構建互動介面逐漸成為最佳選擇。 在本系列教程中,將會帶領大家基於 Laravel 5.5 來構建並測試帶認證功能的 RESTful API。 RESTful API 先要了解什麼是

使用HAL庫、STM32CubeMX和Keil 5開發入門教程點亮一盞LED燈NUCLEO-F411RE

本教程以點亮一盞LED燈為目的為大家演示用STM32CubeMX、Keil、ST-Link開發STM32程式的一般方法。 一、學習前的準備工作 硬體:一塊STM32開發板,一個ST-Link;(我使用的開發板是NUCLEO-F411RE) 軟體:裝好ST-

java程式設計師的大資料之路5HDFS壓縮解壓縮

背景 好久沒有更新了,原因是公司專案上線,差點被祭天。在這種驚心動魄的時候還是要抽時間做一點自己喜歡做的事情的,然而進度比預期慢了許多。 正式開始 接下來就開始記錄最近的學習成果啦! 在Hadoop叢集中,網路資源是非常珍貴的。因此對檔案進行壓縮是非

包建強的培訓課程5演算法資料結構

演算法與資料結構 一、簡介 本課程是基於資料結構來設計的,蒐羅了各大公司面試過程中經常被問到的300多個演算法題目,從中遴選出70道經典題目,分為單鏈表、陣列、二叉樹、棧、數字、邏輯推理等多個類別。 本課程將培訓學員的邏輯思維能力,學以致用,在實戰中編寫出效能更好、邏輯更嚴謹的程式。 本課程適

.Net Core實戰教程設置Kestrel的IP端口的幾種方法

comm 代碼 lec 配置文件 path 端口 img startup ner .Net Core實戰教程(二):設置Kestrel的IP與端口的幾種方法 1.直接寫在代碼方式 Program.cs代碼如下: using System; using System.Col

Redux 入門教程中介軟體非同步操作

上一篇文章,我介紹了 Redux 的基本做法:使用者發出 Action,Reducer 函式算出新的 State,View 重新渲染。 但是,一個關鍵問題沒有解決:非同步操作怎麼辦?Action 發出以後,Reducer 立即算出 State,這叫做同步;Action 發出以後,過一段時間再執行