Abstract
As High Performance Computing and Big Data analytics become more commonplace, we see researchers applying these tools in new areas. Indeed, in the past few years, we've seen the use of HPC in diverse areas such as archeology, public policy, and digital humanities. So it comes as no surprise that many life science researchers are now approaching us to use large scale computation and data analytics on their sensitive data sets, such as de-identified patient or genomics data, for the purposes of scientific inquiry. At UC Berkeley, this has become a pressing issue, as existing faculty need a place to do research on sensitive data. And we knew of at least one instance where it affected the campus' ability to recruit a new faculty member. We had a clear imperative for action!
An informal survey informed us that most other institutions built a new dedicated system to support their sensitive data research, including identified and HIPAA data. This paper is a case study of how we met this need by using a methodology to apply our campus cybersecurity framework, with the help of our institution's cybersecurity team, to convert our traditional production HPC cluster, with over 2000 users across 100 research groups, and our virtual machine service offering, to also support this type of research. Our efforts show that it is not only possible, but also that it is also a practical alternative to take this approach instead of building a new environment.
The field of information security focuses on defense-in-depth and as yet and offers no turnkey solutions that would prevent security incidents and breaches of data. As a result, the focus of most university and research lab information security groups is on preparing for detection of a breach after the fact and limiting its scope of impact. These realities combine such that to do secure computing in a high performance computing (HPC) cluster or on virtual machines in the Cloud, one must implement technical security controls, and write a host of process and audit documentation, which is both labor-intensive and on-going.
The paper describes our work at UC Berkeley to take an existing HPC cluster, with a base level of data security controls and procedures in place, and reconfigure it to meet more secure university and federal requirements, while maintaining the same computing experience and functionality on the system for users who are not computing over sensitive data. In other words, this is a study in configuring a hybrid HPC system for computing over non-sensitive and sensitive data alike, and our work to develop the policy and procedures to meet our information security requirements. It describes in detail the technical as well the educational and partner-building work we did at our institution to make this work a success.