Apache Lucene : The First Part
If you have already read this article, you can check Part 2 here:
- Apache Lucene: Part Two, Indexing Concepts
Hi, am back after quite some time. My last article was on Next.Js and I had planned a couple more posts on the same, building a project, but couldn’t go ahead with it, got swamped at work.
So this time around, I decided to write up an article on something relating to work.. easier to do the research since it help’s at work as well 😬.
And that’s how I ended up at Apache Lucene.
This is the first of a series of articles that I plan to write on Lucene, how to make applications using it, the inner details of it etc.
***I am learning about Lucene myself, while writing this series, so I might miss out on some details, or might make some mistakes. Whenever I find it, I will definitely update / release a new post talking about them***
What is Apache Lucene ?
Apache Lucene is an open source full-text search engine library, that is, you can develop your own search application using this. And the best part — it’s free 😎 and it has an Apache License — which means you don’t need to worry about royalties when using it.
Apache Lucene is written in Java .. which means it’s super simple to develop a well handled application very easily. Yeah Java requires you to write a lot, but the end result is an overall well built application, that works fast and is stable as well (given you perform steps like Exception Handling).
If you want to use Lucene with something other than Java — Python, JavaScript (NodeJS), C++ then they have their ports as well PyLucene, Lucene++, node-lucene etc..
Elasticsearch — the popular search engine is based on top of Lucene — so you can imagine just how diverse Lucene’s uses are.
How Apache Lucene Works ?
On a very high level, Lucene goes through the data that you want to be able to search on and makes an index out of it. Then it takes the query and searches it based on the index created before and returns the results.
The above description, is so high level that it makes Lucene feel like a generic every other search engine. So here’s a little more detailed breakdown.
Say you had a collection of 1000 research papers. Lucene will go through each paper, it will read them, parse & tokenize them, form an inverted index out of it and store it. When a query is received, it will parse & tokenize the query, the same way it had parsed the research papers, and then it will search for this parsed query in its index, and return the results for them.
In the above description, there are two phrases / terms that need a little more explanation.
Inverted Index
What is an inverted index ? Well an index is just a data store, which helps accelerate the searching and there are many ways to create an index. Inverted index is one way of creating an index. In this, every document is broken down into tokens and then for each unique token a list is formed, this list contains the document number / Id which contains this token.
Whenever a query is encountered, the query is processed the same way a document is processed and broken down into tokens. Then for each of the token you get it’s list of documents, perform a boolean AND and get the common documents.
This is how a very simple searching is done you can perform using inverted index.
Parse and Tokenize a Document
What does it mean to parse and tokenize a document ?
Whenever a document is read, it is always broken down into documents. While this is the approach we use, doing this directly without preprocessing the data of the documents is inefficient. Some examples —
1. All documents have very common words such as the, a, an, is.. etc, called stopwords. These words are those, that are not unique/restricted to some documents, and have no significant impact on uniquely locating the best results. They end up increasing the size of the index, without actually providing a big boost in the performance and accuracy. Thus one preprocessing would be to remove these stopwords
2. A lot of words could be reused.. for example the same documents could have the word window as well as windows. Saving them separately won’t provide any boost in efficiency or accuracy of the results. Hence we can collapse the two into one word window.
There are many more ways of preprocessing a document. The main reason behind this is to index the most important and unique characteristic of the documents, to make searching the index faster and easier.
Once the parsing is done, we can then perform the tokenization. Tokenization is splitting the document into it’s basic tokens to form the inverted index. One way of tokenization is white-space based splitting of the document and each individual word becomes a token.
It makes logical sense to parse and tokenize the query the same way the documents are, so that searching can yield the best results.
😵😵😵😵.. I hope it’s not too much of an information overload. In any case, I believe I should stop here and discuss these details in a separate article of it’s own.
So Far
So till now, we know what Apache Lucene is, and a high level overview of how it works. Lucene and search in general is a very big topic and just one article can never be enough. In the next one, I will deep dive into indexing. Talk more about what an inverted index is, what is parsing, tokenization, different methods to do it, advantages disadvantages etc.
Hope you had fun reading it and if I happen to have made a mistake somewhere, please let me know.