The Internet as a Directed Graph
Before we dive into writing the crawler, we should develop a model for viewing the web. We can think of the internet as a directed graph where nodes correspond to web pages and an edge from node X to node Y corresponds to a hyperlink on page X linking to page Y. In this approach, one immediately notices certain structural patterns embedded into the state of the internet. For one, this graph is quite dense in some areas and quite sparse in others. Sites that dominate the internet (e.g. Google, Facebook, Twitter, Netflix, etc) will tend to have significantly more edges directed into them as compared to smaller sites such as this one.
The Crawling Algorithm
The crawler starts by obtaining a list of the most popular websites on the internet. For this, I used Amazon's list of the top 1 million sites found here. The crawler starts by dividing the list evenly amongst the threads to be spawned, at which point each thread loads its portion of the list into its own queue. Once the queue has been populated with these initial "seed" URLs, the crawler follows this simple recipe:
- Pop a URL out of the queue
- Send a request to the URL & retrieve its associated webpage
- Save the webpage to disk for future indexing
- Parse the webpage and extract the links it contains
- For each URL found, check if it has been seen before
- If not, add it to the queue
- Goto 1
I’ll be writing this in Python mainly because I’m lazy and don’t feel like parsing HTML in C++. Also, the crawler is I/O bound anyways since we’ll be spending most of the time waiting on network requests. We’ll obviously spawn a bunch of threads to avoid sitting idle.
Aside from dissecting links, the crawler should somehow keep track of its previously visited URLs to avoid parsing duplicate webpages. We’ll be using a Bloom filter for this.
Quick Detour: Why a bloom filter and not a plain old hash table?
Bloom filters let us test for set membership in constant time (like hash sets), however they take up much less space. This is useful since we’ll be crawling a ton of links.
Here’s a stackoverflow post that goes into more detail: https://stackoverflow.com/questions/4282375/what-is-the-advantage-to-using-bloom-filters
I set the crawler loose on a single node with 6 Cores, 8GB of RAM and 320GB of SSD. With this setup, I managed to retrieve north of 1.5 million web pages in about 48 hours (I’m sure my ISP is happy). Now that we have our data, we can begin writing the indexer.