Taha Azzaoui

Writing an Indexer

2018/08/01

Introduction

Now that we’ve crawled the web and obtained a set of documents, we probably don’t want to have to scan through them every time we run a query.

To avoid ripping our hair out every time we run a search, let’s extract some metadata to make queries faster.

Inverted Index

An inverted index is exactly what it sounds like: you take a regular old index, which maps keys to values, and do the opposite - map values to keys.

Basically, rather than mapping documents to the words that they contain, we instead map words to the documents that contain them. This lets us find documents that contain a given term in constant time.

Quick example

Consider the following set of documents.

  1. NASA found water on Mars
  2. Flowing water found on Mars
  3. Underground lake of water found on Mars

Plain old index:

Inverted index:

Now consider searching for the word "water". Using the plain old index we would have to iterate over each document, then iterate over the terms it contains looking for a match (ew).

Using the inverted index, we simply look up the list associated with "water" and immediately see that it can be found in documents 1, 2, 3.

This is a classic space vs time tradeoff. The inverted index takes more space (assuming there are more unqiue terms than documents) but lookup is pretty fast.

Building the Inverted Index

Before building the inverted index, we’ll have to do some cleaning up to make sure we’re not storing any garbage in our index.

Preprocessing

This part’s pretty boring, but here’s the gist:

  1. Strip out any unnecessary information. This includes any JavaScript, CSS, comments, etc
  2. Construct a list of the unique tokens on the page
  3. Convert every token to lowercase
  4. Stem each token to reduce the word down to its root form. This allows us to treat words with the same stem as synonyms. For example, the terms "walking", "walked", and "walks" all refer to the word "walk".
  5. Remove any stop words from the list. These are words that don't contain much information about the content. Words including "the", "is", "a", "at", etc

Additionally, we should store the number of occurrences of a given token within each document. This will be of use in determining the tf-idf weight of a term during the ranking process (more on this later).

The resulting data structure for each token will be a list of pairs where each pair represents a document in which the token exists. The first element of each pair will be an integer representing the number of times the token appears in the document, and the second element will be the string representation of the base16 encoded URL of the web page. To keep it simple, we will use python's pickle library to write each list to a file under the base16 encoding of the token.

Code

The indexer's source code can be found here. Next, we’ll tackle ranking the results. This step is less straight forward than crawling and indexing but I hear it’s the bread and butter of a multibillion dollar company so I guess it’s important.