It's tough to imagine the internet without search engines. We've grown accustomed to querying the massive set of data that is the web and receiving pages of relevant results in a fraction of a second. As users we expect our search engine to be fast, relevant, and easy to use. As such, search engines present a set of interesting (and mostly solved) challenges from algorithmic design to resource management. Ann Patterson of Stanford does an excellent job describing a subset of them here.
My goal is to get an idea of how search engines generally work. I'll be documenting the interesting bits along the way.
These cover some of the basics of IR and are pretty useful for getting started
- The Anatomy of a Large-Scale Hypertextual Web Search Engine [Sergey Brin & Lawrence Page]
- Introduction to Information
[Manning et al.]
- Particularly the chapter on Web Search Basics
The Components: A Quick Refresher
There are essentially four main phases in building a search engine. I'll briefly describe them below, but we'll go into much more detail later on.
- This is the "data collection" phase. The crawling algorithm is fairly straightforward. We start off with a set of "seed" URLs to visit. As we visit each seed, we extract the hyperlinks it contains and add them to the set of URLs we've yet to visit. We continue this recursive procedure, visiting the remaining links until we've reached our desired depth.
- This is where we store the data we've previously crawled in a way that makes it quick to retrieve. Look up speed is of paramount concern which means we'll need to choose our data structures wisely. We'll be implementing an inverted index for our indexer..
- This part is arguably the most important part of the whole process, it's the secret sauce. It's what differentiates Google from Bing. Now that we can access the data we've retrieved, the question becomes how do we display it? Which result do we show first? How do we predict what the user wants to see? We'll be using a modified version of Page Rank (Page as in Larry Page) as our ranking strategy.
- Query Serving
- This is mostly a distributed systems problem. Taking steps to ensure redundancy, using proper load balancing, minimizing download times, partitioning files, etc. However, since we're not writing the next big commercial search engine, I suspect this will be the area we focus on the least.
The code base for this project can be found here. We'll be implementing the logic from scratch but I’ll rely on Python and its libraries to take care of the annoying stuff (parsing HTML).