Apache Lucene: Part Two - Indexing Concepts
The previous article Apache Lucene: The First Part , was a brief introduction of what Lucene is and a high level overview of how it works. This article will be a deep dive on the concepts behind indexing — mainly Inverted Index and Parsing & Tokenization — so that it becomes easy and intuitive, to understand the code, when we finally reach to it.
Parsing & Tokenization
Whenever a document needs to be indexed, first it needs to be processed. The major reason behind this is making the indexing and searching process more efficient. We’ll discuss some common preprocessing techniques that are widely used when it comes to indexing.
One of the most common steps when it comes to preprocessing is removing stopwords from the document’s content.
What are stopwords ? Stopwords are essentially some common words which are always going to be found in any sort of a document, and eliminating them will have no effect on the main idea of that document. Example — Prepositions (a, an, in, at, on, of, the etc.) These words are excluded, so that they don’t occupy extra space in the index and don’t waste time & resources in searching.
Another common method in preprocessing is Stemming and Lemmatization.
Stemming and Lemmatization are text normalization processes. The main logic behind these two techniques is to reduce words to their base words. For example
history ,historical, historically have the same base word history. The same word can be used in different forms, for grammatical reasons. From a search engine’s point of view, they are the same and hence it is more efficient to reduce them to the same word, rather than have 3 different words representing the same meaning.
A very popular algorithm for performing this step is Porter’s Algorithm
There are several algorithms to perform this step, other than the Porter’s Algorithm. There’s a great paper on this, which you can read for more information on the different algorithms.
Tokenization
Tokenization is the process of splitting the content into tokens. Tokens are essentially the most basic unit of the content.
A very simple way of tokenization is splitting based on white-space.
For ex: let the content be history repeat itself many time. then a white-space splitting will result in the tokens [history, repeat, itself, many, time].
There are various ways of tokenizing other than just white space splitting — TreeBank Word Tokenization, Punctuation Based Tokenization, etc. To go through them in detail you can reference this article — Tokenization in NLP: Types, Challenges, Examples, Tools
Inverted Index
Once all the preprocessing is done and tokens are available, then the tokens are used to create the index. A very traditional and simple way of creating an index would be to create a table where rows represent the documents and columns the token and any cell would represent if that token is present in a document or not.
There is one big drawback of this approach, which is that this matrix will be an extremely sparse matrix. Since there will be a large number of tokens, and few will be common amongst any give pair of documents. Thus you will end up using a lot of memory. One way to optimize this is to maintain a table of List, where each row is a document, and list only contains the unique_id of the token
While this saves space, but it still has the issue, that whenever searching is being done, each list will have to be individually checked to see if it has all the tokens of the query.
A better way is to to create a table which contains tokens in rows and the list is of documents
The major advantage of this index is that when searching is being done, you can take the lists of each token of the query, perform a Boolean AND of those lists and return the common documents.
So if the query is comprised of token tk1, tk7, tk35. Then all you have to do is retrieve the lists of these three tokens, and find out the common documents in these lists.
This is a very simple way of how Inverted Index, makes searching easier and more efficient than a standard index.
To read more on Inverted Index you can refer to the following link — https://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html
Now that some basics and background context of indexing, and what an inverted index is, what is stemming, text preprocessing etc is done, we can move on to actually understanding and using Lucene in the next post.
Hope you had fun reading it and if I happen to have made a mistake somewhere, please let me know.