natural language processing blog: information retrieval

阿新 • • 發佈：2018-12-29

Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words

and for each word, we store the list of documents it appears in. Answering a query here is something like O(1) if we hash them, O(log |V|) if we do binary search (V = vocabulary), etc. Why it's called an inverted index is beyond me: it's really just like the index you find at the back of a textbook. And the computation difference is like trying to find mentions of "Germany" in a textbook by reading every page and looking for "Germany" versus going to the index in the back of the book.Now, let's say we have an inverted index for, say, the web. It's pretty big (and in all honesty, probably distributed across multiple storage devices or multiple databases or whatever). But regardless, a linear scan of the index would give you something like: here's word 1 and here are the documents it appears in; here's word 2 and its doucments; here's word v

and its documents.Suppose that, outside of the index, we have a classification task over the documents on the web. That is, for any document, we can (efficiently -- say O(1) or O(log N)) get the "label" of this document. It's either +1, -1 or ? (? == unknown, or unlabeled).My argument is that this is a very plausible set up for a very large scale problem.Now, if we're trying to solve this problem, doing a "common" optimization like stochastic (sub)gradient descent is just not going to work, because it would require us to iterate over documents

rather than iterating over words (where I'm assuming words == features, for now...). That would be ridiculously expensive.The alternative is to do some sort of coordinate ascent algorithm. These actually used to be quite popular in maxent land, and, in particular, Joshua Goodman had a coordinate ascent algorithm for maxent models that apparently worked quite well. (In searching for that paper, I just came across a 2009 paper on roughly the same topic that I hadn't seen before.)Some other algorithms have a coordinate ascent feel, for instance the LASSO (and relatives, including the Dantzig selector+LASSO = DASSO), but they wouldn't really scale well in this problem because it would require a single pass over the entire index to make one update. Other approaches, such as boosting, etc., would fare very poorly in this setting.This observation first led me to wonder if we can do something LASSO or boosting like in this setting. But then that made me wonder if this is a special case, or if there are other cases in the "real world" where you data is naturally laid out as features * data points rather than data points * features. Sadly, I cannot think of any. But perhaps that's not because there aren't any.(Note that I also didn't really talk about how to do semi-supervised learning in this setting... this is also quite unclear to me right now!)

natural language processing blog: information retrieval

natural language processing blog: information retrieval

natural language processing blog: finite state methods

natural language processing blog: Yet another list of things we can do to have more diverse sets of invited speakers

natural language processing blog: structured prediction

natural language processing blog: machine translation

natural language processing blog: Many opportunities for discrimination in deploying machine learning systems

論文閱讀：A Primer on Neural Network Models for Natural Language Processing（1）

Coursera, Deep Learning 5, Sequence Models, week2, Natural Language Processing & Word Embeddings

語言模型和RNN CS244n 大作業 Natural Language Processing

CS224n: Natural Language Processing with Deep Learning 學習筆記

Recent Trends in Deep Learning Based Natural Language Processing(arXiv)筆記

Hands-Natural-language-processing-python 1: NLTK

Investing in AI: When natural language processing pays off

See this simple introduction to Natural Language Processing (NLP)

Natural Language Processing for Fuzzy String Matching with Python

Deep Learning for Natural Language Processing Archives

Biopharma Navigator: Natural Language Processing for Life Sciences

Cogito API: A Natural Language Processing API

What Is Natural Language Processing?

đ 100 Times Faster Natural Language Processing in Python

natural language processing blog: information retrieval

相關推薦