This year, Spark Summit East was held in Boston between February 7-9. With over 1,500 attendees, this was the largest Spark Summit ever outside the Bay Area. Apache Spark, developed in large at AMPLab (the precursor of RISELab), is now the de-facto standard of big data processing. Like the previous Spark summits, UC Berkeley had a very strong presence.
Ion Stoica gave a keynote on RISELab, describing the lab’s research focus on addressing a long-standing grand challenge in computing: enable machines to act autonomously and intelligently, to rapidly and repeatedly take appropriate actions based on information in the world around them. The presentation also discussed some early results from two recent projects, Drizzle and Opaque, which had their own presentations at the summit.
Shivaram Venkataraman gave the talk on Drizzle, a low latency execution engine for Apache Spark targeting stream processing and iterative workloads. Drizzle removes one of the key performance bottlenecks in Apache Spark: task scheduling. To do so, Drizzle introduces group scheduling, where multiple batches (or a group) of tasks are scheduled at once. This helps amortize the costs of task scheduling, serialization and launch. By doing so, Drizzle improves the latency of Spark Streaming by 10x and brings it on par with the specialized streaming frameworks, such as Apache Flink.
The talk on Opaque, a new system to securely process SparkSQL workloads, was given by Wenting Zheng. When it comes to security, the state-of-the-art of big data processing frameworks such as Apache Spark and Apache Hadoop is to encrypt data at rest and in transit. Unfortunately, to process this data, these solutions store the data unencrypted in memory, exposing it to an attacker who has compromised the operating system or hypervisor. Opaque defends against these attacks by leveraging Intel’s SGX trusted hardware to perform computation on encrypted data. This way, Opaque can protect against malicious OSes or hypervisors. Furthermore, Opaque provides an “oblivious” mode that protects even against access pattern leakages, where the attacker can extract information by just observing the access pattern of a query over the network or to memory, even when the traffic is encrypted.
In addition, there were four other RISElab talks at Spark Summit on four different projects, Clipper, Tegra, Ernest, and ADAM.
Anand Iyer gave the talk on Tegra, a time evolving graph system built on top of Apache Spark. Today, more and more services and applications (e.g., social networks) generate underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. Tegra is a new system addressing this challenge by extending the existing graph processing systems to support streaming computation.
Dan Crankshaw presented Clipper, a low-latency prediction serving system. This work fills a major gap in the ML pipeline, i.e., deploying and serving a model once it has been trained. To this end, Clipper introduces a modular architecture to simplify model deployment across various ML frameworks, such as Spark’s MLlib, TensorFlow, and SciKit-Learn. By leveraging a host of techniques, including caching, batching, and adaptive model selection, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
Shivaram Venkataraman also presented Ernest, a tool to accurately predict the running time of a job given a cluster configuration. In particular, Ernest uses experimental design to reduce the overhead in building a performance model for a given job. Given this model, the developer can the configure an optimal sized cluster to either minimize the running time of the job subject to cost constraints, or minimize the cost subject the job deadline.
Finally, Frank Nothaft presented ADAM, a library for genomic analysis built on top of Apache Spark that targets the detection and analysis of rare genomic events. This is a non-trivial challenge as it requires sophisticated analysis across large cohorts consisting of terabytes to petabytes of genomic data. The existing genomic analysis tools fall short as they were not designed to handle such scale. ADAM addresses this challenge by leveraging Apache Spark to allow genomic analyses to be seamlessly distributed across large clusters. In addition, ADAM presents a simple yet powerful API for writing parallel genomic analysis algorithms.
Many of these projects are on their way to either become part of Apache Spark’s core (e.g., Drizzle), libraries and packages on top of Apache Spark (e.g., ADAM, Earnest, Tegra) or stand alone projects (e.g., Clipper). As RISELab is taking off we are looking forward to continue to contribute to the Apache Spark ecosystem, as well as building new systems to fulfill the promise of supporting a new generation of applications that continuously interact with the environment, intelligently and securely.