1. 程式人生 > >natural language processing blog: information retrieval

natural language processing blog: information retrieval

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words

and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v
and its documents.Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).My argument is that this is a very plausible set up for a very large scale problem.Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents
rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)