Analyzing 0.5TB (one billion vertices) on Amazon’s EMR

Today I’m analyzing the properties of a 0.5TB dataset (a billion vertices in a graph) using Pig/Hadoop on Amazon’s Elastic Map Reduce service. I configured a cluster which contains the following nodes:

  • 1 MASTER: c1.medium
  • 9 CORE: c1.xlarge x9 (High-CPU Instance)
  • 10 SPOT: c1.xlarge x10 (High-CPU Instance), bid $0.20

This cluster processed a 2.5 billion record data file (0.5TB) in about 15 (!) minutes. Very impressive!