Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology
Nature Reviews Genetics, 2011•nature.com
Figure 1| Applying a MapReduce approach in the cloud to solve embarrassingly
parallelizable problems. To traverse a 1 petabyte (PB) data set, Trelles et al. mistakenly
assume that the 1 PB data set needs to be traversed by every node. The ideal MapReduce
application (depicted in the upper panel) instead distributes 1 terabyte (TB) to each of the
1,000 nodes for concurrent processing (the 'map'step in MapReduce). Furthermore,
although Trelles et al. cite a paper that they claim indicates a 15 MB/s link between storage …
parallelizable problems. To traverse a 1 petabyte (PB) data set, Trelles et al. mistakenly
assume that the 1 PB data set needs to be traversed by every node. The ideal MapReduce
application (depicted in the upper panel) instead distributes 1 terabyte (TB) to each of the
1,000 nodes for concurrent processing (the 'map'step in MapReduce). Furthermore,
although Trelles et al. cite a paper that they claim indicates a 15 MB/s link between storage …
Figure 1| Applying a MapReduce approach in the cloud to solve embarrassingly parallelizable problems. To traverse a 1 petabyte (PB) data set, Trelles et al. mistakenly assume that the 1 PB data set needs to be traversed by every node. The ideal MapReduce application (depicted in the upper panel) instead distributes 1 terabyte (TB) to each of the 1,000 nodes for concurrent processing (the ‘map’step in MapReduce). Furthermore, although Trelles et al. cite a paper that they claim indicates a 15 MB/s link between storage and nodes6, the bandwidth quoted appears to be for a single input/output stream only. As shown in the lower panel, best practice is to launch multiple ‘mappers’ per node to saturate the available network bandwidth7, which has been previously benchmarked at~ 50 MB/s8 (threefold higher than the 15 MB/s claimed) and consistent with the 90+ MB/s virtual machine (VM)‑to‑VM bandwidth reported6. Each node can process 1 TB at 50 MB/s at $0.34/h; therefore, the back‑of‑the‑envelope calculations of Trelles et al. should be updated to state that 1,000 nodes could traverse 1 PB of data in~ 350 minutes (not 750 days) at a cost of~ US $2,040 (not $6,000,000).
nature.com