natural language processing blog: information retrieval
Due to a small off-the-radar project I'm working on right now, I've been building my own inverted indices. (Yes, I'm vaguely aware of discussions in DB/Web search land about whether you should store your inverted indices in a database or whether you should handroll your own. This is tangential to the point of this post.)For those of you who don't remember your IR 101, here's the deal with inverted indices. We're a search engine and want to be able to quickly find pages that contain query terms. One way of storing our set of documents (eg., the web) is to store a list of documents, each of which is a list of words appearing in that document. If there are N documents of length L, then answering a query is O(N*L) since we have to look over each document to see if it contains the word we care about. The alternative is to store an inverted index, where we have a list of words