RISECamp was held at UC Berkeley on September 7th and 8th. This post looks behind the scenes at the technical infrastructure used to provide a cloud-hosted cluster for each attendee with ready-to-use Jupyter notebooks requiring only a web browser to access.
Background and Requirements
RISECamp is the latest in a series of workshops held by RISELab (and its predecessor, AMPLab) showcasing the latest research from the lab. The sessions consist of talks on the latest research systems produced by the lab followed by tutorials and exercises for attendees to get hands-on practical experience using our latest technologies.
In the past, attendees used their own laptops to perform the hands-on exercises, with each user setting up a local development environment and manually installing the software from USB flash drives that we provided. This required attendees to have a laptop capable of running MacOS or Linux and to have substantial proficiency with Unix programming environments. For the typical attendee this was time-consuming and a distraction from the actual content of the exercises. For some attendees, this approach made the tutorials impossible and blocked the user from proceeding (for example, some attendees got stuck while trying to run a Linux virtual machine on a locked-down corporate laptop running Windows without superuser access to enable virtualization support).
To address these issues and allow attendees to focus on the actual tutorials instead of system administration, we opted for a cloud-hosted Jupyter notebook based approach. The notebook environment only requires that attendees have a functioning web browser, and provides a convenient way to interleave the instructional content of the tutorial with editable code and dynamically generated graphics and plots.
Other requirements on the system are that it should be easy to develop and manage from a technical and organizational standpoint. The authors of tutorials should be able to easily integrate their exercises and test locally on their laptop, and any last-minute bugs encountered should be able to be easily fixed and deployed in in the production environment.
Architecture
Our implementation strategy consisted of three main components:
- Tutorials created as Jupyter notebooks, with instructional text, executable code, plots, and graphics, all requiring only a web browser to access.
- A Docker container with the tutorial notebooks and requisite software pre-installed.
- An orchestration application handling cluster provisioning and management, authentication, and client multiplexing.
The Container
As defined by Docker:
A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Available for both Linux and Windows based apps, containerized software will always run the same, regardless of the environment. Containers isolate software from its surroundings, for example differences between development and staging environments and help reduce conflicts between teams running different software on the same infrastructure.
The use of a Docker container greatly simplified our development and integration of the exercises by providing a consistent and reproducible environment regardless of the specifics of the developer’s environment, while also providing confidence that the behavior would be identical in the production environment. This substantially saved development time by avoiding all issues that could arise from differences in operating systems and development environments.
The source materials used for building the container are available on GitHub at ucbrise/risecamp.
The Orchestrator
The orchestrator is the last piece of infrastructure which ties everything together to:
- Provision computing clusters on Amazon Elastic Compute Cloud (EC2)
- Manage user accounts and logins
- Route connections from a user’s browser to their corresponding compute cluster
- Deploy patches and updates
To accomplish these objectives, we created a small web application written in NodeJS running on a master node referred to as the “mux“. The mux launches and manages a fleet of compute clusters, each of which is referred to as a “ship”, short for “container ship”.
Cluster boot up tends to be one of the most failure-prone parts of the process, so we established an out-of-band control channel on a private subnet for ships to report their state transitions: pending, booting, available, running, or terminated. In particular, by having ships report to the mux when they enter the “available” state, we ensure that we only ever assign a healthy cluster to a user.
When enough ships have booted and become available, the mux becomes ready to accept user logins over HTTP. As users login through the web interface, they are allocated to a particular ship, and the user’s connection is forwarded to the corresponding ship via an HTTP reverse proxy.
Furthermore, the use of a container-based implementation allowed us to easily redeploy the image running on the cluster when bugs were found and fixes were made. This was accomplished by simply having the orchestrator instruct each ship to pull the latest Docker image and restart the container.
Future Work
We plan to continue this approach in our future workshops, with the following changes to further improve both the developer and attendee experience:
- Create a separate overlay filesystem to persist each user’s state. The current implementation uses the same filesystem for both the instructional content and the user’s saved files. Therefore any files created or edited by the user are not persisted across redeploys of the container. We took care to only perform redeploys in between tutorial sessions, but it would be a better user experience to make this more transparent by persisting their state while also allowing us to update the content.
- Use a separate Docker container per project. We spent a lot of time integrating all of the tutorials into a single Docker container with a consistent set of compatible dependencies. This process was error prone and required much revision, yet resulted in a bug in production due to a subtle interplay between initialization scripts of different projects. (In particular, a nonzero exit code from an initialization script that normally would have halted the container’s boot process was instead not caught.)
- Test scaling to full production capacity! We had taken care to have our Amazon EC2 service limits increased, and we had done testing at a scale of about 25% of our target capacity. However, we didn’t perform a test at full target capacity, and ran into an unexpected scaling limit with the EBS volumes. Luckily, our AWS Services Technical Contact happened to be in attendance at RISECamp, and he was able to help us expedite and escalate the service limit increase support ticket. (Thanks Kevin!)
Conclusion
Overall this approach was a massive success and led to a RISECamp experience with minimal overhead for attendees on setup and system administration. We had a remarkably small number of technical problems, and were able to preemptively deploy hotfixes in most cases where we did encounter issues.