Particularly used as an IR (Information Retrieval) technique. Used for query patterns where document retrieval/classification is involved based on the presence of certain words. For example, in a document repository, all documents need to be traced which contain the words “field”, “ball”,“players”, “bat” and “team” but not “tennis”. This might help us classify the returned documents’ content to be related to the game of cricket.
It is particularly useful when the data set/document repository under processing is very large in size.
Naive approach for Document Query Patterns
A naive way to look up all documents in a given document repository, containing or not containing certain character strings/words/content is to scan through each document in the repository looking for the presence/absence of the provided phrases before finally coming up with a final result.
Disadvantages of this approach can be enlisted as follows:
- The amount of data to be scanned is a linear function of the size of the data in the entire repository which can keep growing over time. In other words, the search complexity in this case is O(n) where n is the size of the repository. It is fine till the growth is limited to a few thousand documents probably containing about a thousand words each but becomes a problem in case both the numbers go beyond a million.
- Queries with a different query string on the same set of data involves scanning the entire data content of the repository all over again.
What does inverted index provide?
Inverted index is similar to the word index that we find in the last section of a book where if we want to lookup a particular word, we can refer to this book index, find the given word, get the list of page numbers (from the index) where the word is present and hence get the relevant pages. The book index is built only once and any number of queries can be made on this index with different query strings and the relevant pages looked up.
Similarly, inverted index is a technique where each of the documents in a repository are pre-processed or scanned once, and a word/phrase index is built out from the same facilitating any future query requests with different query strings. Thus, the document repository need not be scanned multiple times to service different query requests which is a much more performant way of looking up the documents.
Using bit vector (Term-document incidence matrix)
A good way of constructing an inverted index is to represent all distinct terms and the documents in the repository using something called as a term-document incidence matrix.
The idea is simple. It is a mxn matrix where m represents the number of distinct terms (words) as rows in the matrix and n represents the total number of documents, as columns in the matrix. Following is a sample representation of the same
So, in the above term-document incidence matrix there are a total of 7 distinct terms spread across a total of 3 documents in the repository. The value of 1 or 0 in a cell represents whether the corresponding term is present or absent in the corresponding document respectively. Note that the above matrix is not representative of the frequency of occurrence of that term in the given document or the position of occurrence.
Now, imagine each row as a bit vector of size 3 with each bit position being representative of each of the documents in the repository. So, the following are the bit vectors for each of the terms:
Term 1 → 101
Term 2 → 001
Term 3 → 111
Term 4 → 101
Term 5 → 110
Term 6 → 100
Term 7 → 010
Now, given a query string such as “Give me all documents which contain Term 1 and Term 5 but not Term 7”, we can represent the operation as :
Bit vector (Term 1) AND Bit vector (Term 5) AND NOT(Bit Vector (Term 7))
= (101) AND (110) AND complement(010)
= (101) AND (110) AND (101)
which gives us the result that there is only one such document which is Document 1.
Likewise, all such queries can be represented using logical operations on the term bit vectors.
Hence, there is a one time scan of all documents to construct the term-document incidence matrix and all queries can then be answered in constant time after that.
Problem with the above approach
The problem with this approach manifests in the form of space requirements for the term-document incidence matrix when the number of terms and documents increases manifold from just a few documents/terms to a million documents and terms.
Going by the above mentioned approach, if we assume that each cell in the term-document incidence matrix is represented by just a single bit, then for a million documents and a million terms, the matrix consists of 106 * 106 = 1012 cells which is equivalent to 1012 bits roughly equal to 125 GB of space. If we want to process all queries in-memory and want to have the term-document incidence matrix available in memory all the time, a really powerful computer is required to address the same (125 GB of RAM).
Hence, we need a better way to optimize the space requirement for this approach.
Sparse Matrix Analysis and Efficient Storage
If we observe the term-document incidence matrix closely (for a large number of documents and terms), we would see that the number of cells having the value 1 is a very small percentage of the overall number of cells in the matrix.
Proof: Let us assume that the total number of documents under analysis is 106 and the total number of distinct terms spread across these documents are 106. On an average, each document contains only 1000 terms as the document size is limited. Hence, a rough estimate of the total number of 1s in the matrix can be given by
Average no. of terms in a document * Total no. of documents
1000 * 106
Hence, the total number of 1’s in the term document matrix is 109. Now, taking a percentage against the total number of cells in the matrix = (109/1012) * 100 = 0.1% of the cells contain 1’s and hence almost 99.9% of the cells contain 0 which proves that the matrix is highly sparse.
Looking at the above, we can realize that how much space we were wasting in the earlier mentioned approach to store 0’s. To optimize the same, we can instead store the document ids for each of the terms in which the term is actually present, which is equivalent to storing information only about the cells containing 1’s. This would optimize the storage costs a great deal against the backdrop that only 0.1% of the original data needs to be stored now.
Hence, the inverted index can then be represented as a document list for each term. Continuing our example, following would be the representation:
Term 1: 1 → 3
Term 2: 3
Term 3: 1 → 2 → 3
Term 4: 1 → 3
Term 5: 1 → 2
Term 6: 1
Term 7: 2
It can be argued whether to use a variable size array or a linked list to store the above document list. Both the data structures would address different kind of scenarios and hence, it is up to the implementer to strike a balance between all the tradeoffs and select the appropriate data structure.
A variable sized array is good for storing the data together and sequential access is optimized in that case. Consider, reading a block of data from memory, in that case having the data stored together in memory brings in its own benefit of continuous doc ids being cached together in the processor for future access.
A disadvantage of using a variable sized array is that if any term gets added to a document dynamically and due to the sorted nature of the doc id list (which we’ll see later as to why it is sorted), a large number of array elements might need to be moved to accommodate the new entry. That is when a linked list comes handy.
Sometimes, even storing this data for a huge number of documents might become a bottleneck in terms of storing it all in memory. In that case, this information would need to be stored on disk and loaded into memory whenever required. In that case, having these ids stored contiguously will help in doing less disk seeks when loading all doc ids corresponding to a given term. Thus, the data structure should be selected in consideration of this fact.
Performing queries on document lists
To perform queries like the one in the example we took the last time where we needed to find documents containing Term 1 and Term 5 and not Term 7, we would follow the popular merge sort technique. Since the document lists are sorted in nature, similar to the merge phase in merge sort, we can look at each doc id in the 2 terms on which the query needs to be applied, and then take appropriate action. Consider the example that we just took:
In order to compute: (Term 1) AND (Term 5) AND (NOT(Term 7)), let us write down the document lists for each of these terms:
Term 1: 1 → 3
Term 5: 1 → 2
Term 7: 2
To combine Term 1 and Term 5, similar to the merge phase of the merge sort algorithm we have 2 pointers each pointing to the first element of both the lists. The elements are compared with each other and since it is an AND operation, if the elements are same, they are written to our result list and both the pointers are incremented to the next spot, otherwise the pointer pointing to the smaller element out of the 2 gets incremented. Here is a small pseudo code to illustrate the same:
pointer pointer1 = operandlist1→ first element
pointer pointer2 = operandlist2 → first element
while (pointer1 and pointer2 are not null)
if (pointer1 isequalto pointer2)
else if (pointer1 islessthan pointer2)
Hence, (Term 1) AND (Term 5) results in (1)
A similar approach could be adopted to perform the AND of the result with the complement of Term 7 and the final result could be obtained.