Saturday, September 26, 2015

Inverted Index

Use Case

Particularly used as an IR (Information Retrieval) technique. Used for query patterns where document retrieval/classification is involved based on the presence of certain words. For example, in a document repository, all documents need to be traced which contain the words “field”, “ball”,“players”, “bat” and “team” but not “tennis”. This might help us classify the returned documents’ content to be related to the game of cricket.


It is particularly useful when the data set/document repository under processing is very large in size.


Naive approach for Document Query Patterns

A naive way to look up all documents in a given document repository, containing or not containing certain character strings/words/content is to scan through each document in the repository looking for the presence/absence of the provided phrases before finally coming up with a final result.


Disadvantages of this approach can be enlisted as follows:


  1. The amount of data to be scanned is a linear function of the size of the data in the entire repository which can keep growing over time. In other words, the search complexity in this case is O(n) where n is the size of the repository. It is fine till the growth is limited to a few thousand documents probably containing about a thousand words each but becomes a problem in case both the numbers go beyond a million.
  2. Queries with a different query string on the same set of data involves scanning the entire data content of the repository all over again.


What does inverted index provide?

Inverted index is similar to the word index that we find in the last section of a book where if we want to lookup a particular word, we can refer to this book index, find the given word, get the list of page numbers (from the index) where the word is present and hence get the relevant pages. The book index is built only once and any number of queries can be made on this index with different query strings and the relevant pages looked up.


Similarly, inverted index is a technique where each of the documents in a repository are pre-processed or scanned once, and a word/phrase index is built out from the same facilitating any future query requests with different query strings. Thus, the document repository need not be scanned multiple times to service different query requests which is a much more performant way of looking up the documents.


Implementation

Using bit vector (Term-document incidence matrix)

A good way of constructing an inverted index is to represent all distinct terms and the documents in the repository using something called as a term-document incidence matrix.
The idea is simple. It is a mxn matrix where m represents the number of distinct terms (words) as rows in the matrix and n represents the total number of documents, as columns in the matrix. Following is a sample representation of the same


Document 1
Document 2
Document 3
Term 1
1
0
1
Term 2
0
0
1
Term 3
1
1
1
Term 4
1
0
1
Term 5
1
1
0
Term 6
1
0
0
Term 7
0
1
0


So, in the above term-document incidence matrix there are a total of 7 distinct terms spread across a total of 3 documents in the repository. The value of 1 or 0 in a cell represents whether the corresponding term is present or absent in the corresponding document respectively. Note that the above matrix is not representative of the frequency of occurrence of that term in the given document or the position of occurrence.


Now, imagine each row as a bit vector of size 3 with each bit position being representative of each of the documents in the repository. So, the following are the bit vectors for each of the terms:


Term 1  →  101
Term 2  →  001
Term 3  →  111
Term 4  →  101
Term 5  →  110
Term 6  →  100
Term 7  →  010


Now, given a query string such as “Give me all documents which contain Term 1 and Term 5 but not Term 7”, we can represent the operation as :
Bit vector (Term 1) AND Bit vector (Term 5) AND NOT(Bit Vector (Term 7))


= (101) AND (110) AND complement(010)
= (101) AND (110) AND (101)
= (100)


which gives us the result that there is only one such document which is Document 1.


Likewise, all such queries can be represented using logical operations on the term bit vectors.


Hence, there is a one time scan of all documents to construct the term-document incidence matrix and all queries can then be answered in constant time after that.


Problem with the above approach


The problem with this approach manifests in the form of space requirements for the term-document incidence matrix when the number of terms and documents increases manifold from just a few documents/terms to a million documents and terms.


Going by the above mentioned approach, if we assume that each cell in the term-document incidence matrix is represented by just a single bit, then for a million documents and a million terms, the matrix consists of 106 * 106 = 1012 cells which is equivalent to 1012 bits roughly equal to 125 GB of space. If we want to process all queries in-memory and want to have the term-document incidence matrix available in memory all the time, a really powerful computer is required to address the same (125 GB of RAM).


Hence, we need a better way to optimize the space requirement for this approach.


Sparse Matrix Analysis and Efficient Storage

If we observe the term-document incidence matrix closely (for a large number of documents and terms), we would see that the number of cells having the value 1 is a very small percentage of the overall number of cells in the matrix.


Proof: Let us assume that the total number of documents under analysis is 106 and the total number of distinct terms spread across these documents are 106. On an average, each document contains only 1000 terms as the document size is limited. Hence, a rough estimate of the total number of 1s in the matrix can be given by


Average no. of terms in a document * Total no. of documents
=
1000 * 106
=
109
Hence, the total number of 1’s in the term document matrix is 109. Now, taking a percentage against the total number of cells in the matrix = (109/1012) * 100 = 0.1% of the cells contain 1’s and hence almost 99.9% of the cells contain 0 which proves that the matrix is highly sparse.


Looking at the above, we can realize that how much space we were wasting in the earlier mentioned approach to store 0’s. To optimize the same, we can instead store the document ids for each of the terms in which the term is actually present, which is equivalent to storing information only about the cells containing 1’s. This would optimize the storage costs a great deal against the backdrop that only 0.1% of the original data needs to be stored now.


Hence, the inverted index can then be represented as a document list for each term. Continuing our example, following would be the representation:


Term 1: 1 → 3
Term 2: 3
Term 3: 1 → 2 → 3
Term 4: 1 → 3
Term 5: 1 → 2
Term 6: 1
Term 7: 2


It can be argued whether to use a variable size array or a linked list to store the above document list. Both the data structures would address different kind of scenarios and hence, it is up to the implementer to strike a balance between all the tradeoffs and select the appropriate data structure.


A variable sized array is good for storing the data together and sequential access is optimized in that case. Consider, reading a block of data from memory, in that case having the data stored together in memory brings in its own benefit of continuous doc ids being cached together in the processor for future access.


A disadvantage of using a variable sized array is that if any term gets added to a document dynamically and due to the sorted nature of the doc id list (which we’ll see later as to why it is sorted), a large number of array elements might need to be moved to accommodate the new entry. That is when a linked list comes handy.


Sometimes, even storing this data for a huge number of documents might become a bottleneck in terms of storing it all in memory. In that case, this information would need to be stored on disk and loaded into memory whenever required. In that case, having these ids stored contiguously will help in doing less disk seeks when loading all doc ids corresponding to a given term. Thus, the data structure should be selected in consideration of this fact.



Performing queries on document lists


To perform queries like the one in the example we took the last time where we needed to find documents containing Term 1 and Term 5 and not Term 7, we would follow the popular merge sort technique. Since the document lists are sorted in nature, similar to the merge phase in merge sort, we can look at each doc id in the 2 terms on which the query needs to be applied, and then take appropriate action. Consider the example that we just took:


In order to compute: (Term 1) AND (Term 5) AND (NOT(Term 7)), let us write down the document lists for each of these terms:


Term 1: 1 → 3
Term 5: 1 → 2
Term 7: 2


To combine Term 1 and Term 5, similar to the merge phase of the merge sort algorithm we have 2 pointers each pointing to the first element of both the lists. The elements are compared with each other and since it is an AND operation, if the elements are same, they are written to our result list and both the pointers are incremented to the next spot, otherwise the pointer pointing to the smaller element out of the 2 gets incremented. Here is a small pseudo code to illustrate the same:


list resultList
list operandlist1
list operandlist2
pointer pointer1 = operandlist1→ first element
pointer pointer2 = operandlist2 → first element
while (pointer1 and pointer2 are not null)
if (pointer1 isequalto pointer2)
resultList.add(pointer1)
pointer1.advance
pointer2.advance
else if (pointer1 islessthan pointer2)
pointer1.advance
else
pointer2.advance
return resultList


Hence, (Term 1) AND (Term 5) results in (1)


A similar approach could be adopted to perform the AND of the result with the complement of Term 7 and the final result could be obtained.

Friday, April 17, 2015

TF-IDF (Term Frequency-Inverse Document frequency)

TF-IDF is the most fundamental metric used extensively in classification of documents.
Let us try and define these terms:

Term frequency basically is significant of the frequency of occurrence of a certain word in a document compared to other words in the document. It is defined as follows:

TFij=fij/fmaxj


where TFij represents the term frequency of the ith word in the jth document. fij represents the frequency of the word in that document and fmaxj represents the frequency of the word which occurred the maximum number of times in that document.
Hence the term frequency of a word for a particular document can attain a maximum value of 1.

Inverse Document frequency on the other hand is significant of the occurrence of the word in all the documents for a given collection (of documents which we want to classify into different categories). So if there are a total of N documents then the IDF of the ith word present in ni documents can be expressed as the following:

IDFi=log2(N/ni)

The terms with the highest TF*IDF are considered to characterize/classify a document properly.

Interpretation: The significance of the word to characterize a document uniquely is to understand that the word is relatively unique to this document (considering documents in a collection) and it occurs a good number of times as compared to other words in the document.

External Sort - Disk Sort

Hadoop Uses External Sort/Disk Sort extensively in its shuffle and sort phase.

Following are the basic steps for a typical external sorting/disk sorting algorithm:

  1. Read data until there is no space left.
  2. Sort it.
  3. Write it to a temporary file.
  4. Repeat from step 1 until there is no data left to read.
  5. Count the number of temporary files, say N.
  6. Determine how many of those files you can read at one time, say M.
  7. If N > M, then design your merging phase so that the last phase will merge M files.
  8. Merge sets of M files into new temporary files until you reach the last merge.
  9. Merge the final set of M files (or N if N < M) writing to final destination.

Tuesday, April 14, 2015

Parquet File Format


Background
Parquet file format was developed jointly by Twitter and Cloudera. It is inspired by Google’s ProtoBuf (Protocol Buffers) as described in the research paper on Dremel.

Motivation


There was a need to have an efficient file format to store and transfer data for large data processing needs. The conventional row based file format as used by traditional RDBMS like databases is not well suited for data analysis needs. The row based file format stores data for a row together and subsequent rows are stored sequentially. For data analytics in general specific columns are required for analysis at a time instead of complete rows. Thus, a columnar representation of data where columns could be stored together would prevent a lot of disk seeks for column wise retrieval of data. Parquet is one such columnar data file format.

Design


Why columnar?

In most of the data analysis scenarios, often it is only a few columns which are required to be fetched to apply analytical functions at a time and to make some sense in data patterns. In case of a row based storage format loading specific columns could be done in 2 ways i.e.

  1. Loading complete rows into memory from the disk and then applying functions by filtering certain columns in the loaded data. This could turn out to be an expensive operation as there is a lot of redundant data which would require to be loaded and hence more disk I/O in conjunction with the amount of data that could be loaded in limited amount of memory.
  2. Specifically seeking to the blocks where the desired column data is available for each row. Since disk seeks are very expensive in terms of performance, this is something which could be really undesirable.

Hence, if the data can be stored where data pertaining to a column is stored contiguously(together) on disk, it will be far more efficient in terms of disk I/O and memory utilization when specific columns need to be queried for analytics.

Another side benefit of storing columns together instead of rows is homogeneity in data stored together. Columns have a fixed data type (such as string, integer, boolean etc.) as opposed to heterogeneous values (a mix of datatypes) in rows. Hence, it is easier to apply any encoding schemes on columnar data which could even be column specific such as delta encoding for integers and prefix/dictionary encoding for strings. Also, due to the homogeneity in data, there is a lot more redundancy and duplicates in the values in a given column. This allows better compression in comparison to data stored in row format.

Parquet therefore adheres to the above columnar design to achieve efficient disk I/O performance in conjunction with healthy encoding and compression of data.

Addressing unstructured data


With lots and lots of data flowing in from multiple systems on the Web, there has been a marked increase in the amount of unstructured data in comparison to the regular transactional data stored in RDBMS like databases. Now, the challenge is to add some sort of structure to this data to make it available for querying purposes. Google’s paper on Dremel tries to explain that how unstructured data could be represented in a nested fashion like :

User
Name : <User_Name>
Place
        Country: <Country_Name>
        State: <State_Name>
        City: <City_Name>


Now to enable querying of this data efficiently and to store homogeneous data together (data for the same column together), we need a file format definition which could address these factors in the best possible way.

How Parquet stores nested data


There is a notion of replication and definition levels, as also described in Google’s paper on Dremel w.r.t. to Protocol Buffers.

Consider the following nested structure
Slide1.jpg
The above can be represented by the following structure:

Repository {
    repeated Page {
        repeated Link
    }
    required Name
}

Representing the data according to the above structure:

Repository {
    Page {
        Link: Link1
        Link: Link2
    }
    Page { }
    Page {
        Link: Link3
    }
    Name: Name1
}

Thus, it may be observed that the leaf nodes can be mapped to actual columns i.e. Repository.Page.Link and Repository.Name in this case.

Definition level defines the lowest level in the hierarchy which is defined.
Repetition level defines what level in the hierarchy has most recently been repeated.
Using the above 2 attributes and the previously defined structure we can represent the above data as follows:


Repetition Level (R)
Definition Level (D)
Value
0
2
Link 1
2
2
Link 2
1
1
NULL
1
2
Link 3
0
0
Name 1

Parquet optimizes not storing the NULL point actually.

Thus, values for column Repository.Page.Link are all stored together referring to columnar storage. This kind of storage scheme is optimal in representing the actual key of the column as well.

Using the above table and the original schema for the data i.e.

Repository {
    repeated Page {
        repeated Link
    }
    required Name
}

we can generate the original data values as follows:

  1. Repetition Level 0 and definition level 2 i.e. first occurrence of Page and first occurrence of Link with a defined value in the end. Hence,
    Repository {
        Page {
            Link: Link 1
  2. Repetition Level 2 and definition level 2 i.e. repeated occurrence of Link with a value defined in the end. Hence,
    Repository {
        Page {
            Link: Link 1
            Link: Link 2
  3. Repetition Level 1 and definition level 1 i.e. repeated occurrence of Page but since definition level for a value to be defined for page is 2, hence the value is Null. So, an empty occurrence of Page.
    Repository {
        Page {
            Link: Link 1
            Link: Link 2
        }
        Page { }
  4. Repetition Level 1 and definition level 2 i.e repeated occurrence of Page with a defined value in the end.
    Repository {
        Page {
            Link: Link 1
            Link: Link 2
        }
        Page { }
        Page {
            Link: Link 3
  5. Repetition Level 0 and definition level 0 i.e. no repeated occurrence, move to the next attribute at root. The next attribute is Name and since it is a “required” field, hence definition level is not required. The value is assigned to Name.
    Repository {
        Page {
        Link: Link1
        Link: Link2
    }
    Page { }
    Page {
        Link: Link3
    }
    Name: Name1
}
  1. No further values in the table, hence the reconstruction of data is complete.

Note that in the above implementation, we can build the same table independently for the 2 columns i.e. Repository.Page.Link and Repository.Name and reconstruct each of the columns independently.

Projection of the complete data set on just Repository.Name is:

Repository {
    Name: Name1
}

Corresponding repetition level and definition level table is

Repetition Level
Definition Level
Value
0
0
Name 1

While reconstructing, the above can be used to create the projection back again:

Since Name is a required field (non-repetitive and non-optional), hence repetition level is ignored and using the definition level of root, reconstructed projection is:

Repository {
    Name: Name1
}

Similarly, the projection of Repository.Page.Link can be represented and reconstructed. Thus the columns can be stored independently of each other, which allows data corresponding to each column to be contiguously stored together reaping the benefits of columnar storage.

Since, the repetition level and the definition level are upper bounded by the depth of the schema, an equivalent bit representation can encode the levels really well i.e. 1 bit can encode up to level 1, 2 bits up to level 3, 4 bits up to level 7 and so on.

This kind of a representation also comes in handy where only keys need to be retrieved without retrieving the values which could be far more bigger in size. By just looking at the definition and the repetition levels, all the non-null keys may be obtained.