Monday, March 12, 2012

Pretty-print inline diffs in emails

Rietveld (and presumably other code review tools) inline diffs in emails. Even if they don't, they generally have a URL that points at the review. A Gmail gadget could be written that inlines the diffs (if not already inline), and pretty-prints them (with syntax highlighting).

GitHubDiff does something like this for GitHub emails.

Wednesday, March 7, 2012


TodoMVC is a project which offers the same Todo application implemented using MV* concepts in most of the popular JavaScript MV* frameworks of today.

Solutions look and feel the same, have a common simple feature-set and make it easy for you to compare the syntax and structure of different frameworks so you can select the one you feel the most comfortable with.

Monday, March 5, 2012


Google has shut down Code Search and the Social Graph API. Replicating those services outside of Google is hard; they presuppose a scalable, well-oiled crawling/indexing/serving machine. Only Google has been able to release interesting n-gram data (from books too).

Along the same lines, the HTTP Archive tracks 55K URLs (with a goal of 1M by the end of 2012). W3Techs also only looks at the Alexa top 1 million. Presumably this is because crawling is hard. If Foursquare wanted to know how many pages have a link to a URL, would they want to crawl the whole web?

"Aleph" would be an infrastructure company that offers a few services:

  1. Large (billion and up) archive of sites on the internet. This would include not just raw HTML, but the full set of resources necessary to recreate that page (think crawling with headless WebKit and capturing all HTTP requests).
  2. Ability to do quick (semi-interactive) analyses over the crawl data that has been pre-processed (think regular expressions, or Dremel-like queries for attributes).
  3. Ability to run arbitrary MapReduces over the data (see ZeroVM for a way to do this safely).
  4. Ability to import new datasets (whether web-like or not)

For the web index to be useful for some applications, it would need to have a PageRank-like attribute per page to expose importance (and/or a number of visits/traffic).

In theory the Internet Archive has a lot of this data. They make their data available "at no cost to researchers, historians, and scholars," thus their willingness to license is unclear.

As for the name, the Aleph was a MacGuffin-like device from Mona Lisa Overdrive that a large copy of the Internet stored.

Update on 5/7/2012: Blekko appears to offer basic grepping functionality.

Update on 2/24/2013: CommonCrawl is a non-profit with 6 billion pages indexed and available for MapReduce.