Monday, March 5, 2012


Google has shut down Code Search and the Social Graph API. Replicating those services outside of Google is hard; they presuppose a scalable, well-oiled crawling/indexing/serving machine. Only Google has been able to release interesting n-gram data (from books too).

Along the same lines, the HTTP Archive tracks 55K URLs (with a goal of 1M by the end of 2012). W3Techs also only looks at the Alexa top 1 million. Presumably this is because crawling is hard. If Foursquare wanted to know how many pages have a link to a URL, would they want to crawl the whole web?

"Aleph" would be an infrastructure company that offers a few services:

  1. Large (billion and up) archive of sites on the internet. This would include not just raw HTML, but the full set of resources necessary to recreate that page (think crawling with headless WebKit and capturing all HTTP requests).
  2. Ability to do quick (semi-interactive) analyses over the crawl data that has been pre-processed (think regular expressions, or Dremel-like queries for attributes).
  3. Ability to run arbitrary MapReduces over the data (see ZeroVM for a way to do this safely).
  4. Ability to import new datasets (whether web-like or not)

For the web index to be useful for some applications, it would need to have a PageRank-like attribute per page to expose importance (and/or a number of visits/traffic).

In theory the Internet Archive has a lot of this data. They make their data available "at no cost to researchers, historians, and scholars," thus their willingness to license is unclear.

As for the name, the Aleph was a MacGuffin-like device from Mona Lisa Overdrive that a large copy of the Internet stored.

Update on 5/7/2012: Blekko appears to offer basic grepping functionality.

Update on 2/24/2013: CommonCrawl is a non-profit with 6 billion pages indexed and available for MapReduce.