Serverless Scientific Computing

Eric Jonas blog, Projects, Systems

For many scientific and engineering users, cloud infrastructure remains challenging to use. While many of their use cases are embarrassingly parallel, the challenges involved in provisioning and using stateful cloud services keep them trapped on their laptops or large shared workstations.

Before getting started, a new cloud user confronts a bewildering number of choices. First, what instance type do they need ? How do they make the compute/memory tradeoff? How large do they want their cluster to be? Can they take advantage of dynamic market-based instances (spot instances) that can disappear at any time? What if they have 1000 small jobs, each of which takes a few minutes — what’s the most cost-effective way of allocating servers? What host operating system and toolchain should they use? And finally, they need to remember to turn off all of this when their job completes.

For many of these users, existing infrastructure is just too hard. Working with our colleagues at the Berkeley Center for Computational Imaging, we started developing PyWren to see if we could address their use cases. PyWren is designed to support a simple map and reduce-like interface, from python, on top of Amazon Web Service’s new Lambda service. While Lambda was originally designed for microservice-like workloads, and has some limitations, it’s a great fit for many of our computational compute tasks: Large numbers of short ( < 5min) compute jobs that we want to execute massively in parallel.

PyWren started out as an experiment in trying to scale Lambda and run real computational work, and has since grown into a real research project on serverless computing.We’ve been able to get up to 40 peak TFLOPS using PyWren on AWS Lambda and over 60 GB/sec read/write performance to S3. Recently, we wrote up a technical report detailing the API and a performance evaluation of our prototype:

Occupy the Cloud: Distributed Computing for the 99%

Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht

Distributed computing remains inaccessible to a large number of users, in spite of many open source platforms and extensive commercial offerings. While distributed computation frameworks have moved beyond a simple map-reduce model, many users are still left to struggle with complex cluster management and configuration tools, even for running simple embarrassingly parallel jobs. We argue that stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity. Furthermore, using our prototype implementation, PyWren, we show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Extrapolating from recent trends in network bandwidth and the advent of disaggregated storage, we suggest that stateless functions are a natural fit for data processing in future computing environments.

Today we’re releasing a new version of PyWren, v0.1, with a new project website,  and experimental support for both reduce and stand-alone mode, both of which we alluded to in our paper. To try out PyWren, you can follow the getting started guide. We have exciting visions for making this platform more stateless, with lower latency, and increasing the types of workloads we can handle, so stay tuned!

–Eric Jonas