Addressing the challenges of executing a massive computational cluster in the cloud
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and …, 2018•ieeexplore.ieee.org
A major limitation for time-to-science can be the lack of available computing resources.
Depending on the capacity of resources, executing an application suite with hundreds of
thousands of jobs can take weeks when resources are in high demand. We describe how
we dynamically provision a large scale high performance computing cluster of more than
one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs,
challenges, and solutions associated with creating such a large scale cluster with …
Depending on the capacity of resources, executing an application suite with hundreds of
thousands of jobs can take weeks when resources are in high demand. We describe how
we dynamically provision a large scale high performance computing cluster of more than
one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs,
challenges, and solutions associated with creating such a large scale cluster with …
A major limitation for time-to-science can be the lack of available computing resources. Depending on the capacity of resources, executing an application suite with hundreds of thousands of jobs can take weeks when resources are in high demand. We describe how we dynamically provision a large scale high performance computing cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large scale cluster to study a parameter sweep workflow composed of message-passing parallel topic modeling jobs on multiple datasets. At peak, we achieve a simultaneous core count of 1,119,196 vCPUs across nearly 50,000 instances, and are able to execute almost half a million jobs within two hours utilizing AWS Spot Instances in a single AWS region. Our solutions to the challenges and trade-offs have broad application to the lifecycle management of similar clusters on other commercial clouds.
ieeexplore.ieee.org