# Publications

## ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

We introduce AdaHessian, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the Hessian. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and ADAM. The main disadvantage of traditional second order methods is their heavier per-iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in AdaHessian, including: (i) a new variance reduction estimate of the Hessian diagonal with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to more...

Conference/Journal:
Publication Date:
Download:
Author(s): Zhewei Yao, Amir Gholaminejad, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael Mahoney

## PyHessian: Neural networks through the lens of the Hessian

We present PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. PYHESSIAN enables fast computations of the top Hessian eigenvalues, the Hessian trace, and the full Hessian eigenvalue/spectral density, and it supports distributed-memory execution on cloud/supercomputer systems and is available as open source. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks. One recent claim, based on simpler first-order analysis, is that residual connections and Batch Normalization make the more...

Conference/Journal:
Publication Date:
Download:
Author(s): Zhewei Yao, Amir Gholaminejad, Kurt Keutzer, Michael Mahoney

## Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The exact underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze exactly how the landscape of the loss function changes when training with large batch size. We compute the true Hes- sian spectrum, without approximation, by back-propagating the second derivative. Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably more...

Conference/Journal: NeurIPS
Publication Date:
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/10/2018_neurips_hessian.pdf
Author(s): Zhewei Yao, Amir Gholaminejad, Michael Mahoney

## Robust Class Parallelism – Error Resilient Parallel Inference with Low Communication Cost

Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single floating-point number per iteration. However, traditional fault-tolerance schemes, when applied to class parallelism, require storing the entire model on the hard disk. Thus, these schemes are not suitable for soft and frequent system noise such as stragglers(temporarily slow worker machines). In this paper, we propose an erasure-coding based redundant computing technique called robust class parallelism to improve the error resilience of model parallelism. We show that by introducing slight overhead in the computation at each machine, we can obtain robustness to soft system noise more...

Conference/Journal: 54th Asilomar Conference on Signals, Systems and Computers (Asilomar 2020)
Publication Date: 11/01/2020
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/08/Robust_sensAI_Asilomar_2020.pdf
Author(s): Yaoqing Yang, Jichan Chung, Guanhua Wang, Vipul Gupta, Adarsh Karnati, Kenan Jiang, Ion Stoica, Joseph Gonzalez, Kannan Ramchandran

## An Off-Chip Attack on Hardware Enclaves via the Memory Bus

This paper shows how an attacker can break the confidentiality of a hardware enclave with Membuster, an off-chip attack based on snooping the memory bus. An attacker with physical access can observe an unencrypted address bus and extract fine-grained memory access patterns of the victim. Membuster is qualitatively different from prior on-chip attacks to enclaves and is more difficult to thwart. We highlight several challenges for Membuster. First, DRAM requests are only visible on the memory bus at last-level cache misses. Second, the attack needs to incur minimal interference or overhead to the victim to prevent the detection of the attack. Lastly, the attacker needs to reverse-engineer the translation between virtual, physical, and DRAM addresses to perform a robust attack. We introduce more...

Conference/Journal: 29th USENIX Security Symposium (USENIX Security 20)
Publication Date: 08/12/2020
Download:
Author(s): Dayeol Lee, Dongha Jung, Ian Fang, Chia-Che Tsai, Raluca Ada Popa

## FirePerf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design

Conference/Journal: Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2020)
Publication Date: 03/16/2020
Download:
Author(s): Sagar Karandikar, Albert Ou, Alon Amid, Howard Mao, Randy Katz, Borivoje Nikolic, Krste Asanovic

## sensAI: Fast ConvNets Serving on Live Data via Class Parallelism

Convolutional Neural Networks (ConvNets) enable computers to excel on vision learning tasks such as image classification, object detection. Recently, faster inference on live data is becoming more and more important. From a system perspective, it means faster inference on each single, incoming data item (e.g. 1 image). Two main-stream distributed model serving methods – data parallelism and model parallelism – are not desirable here, because we cannot further split a single input data piece via data parallelism and model parallelism introduces huge communication overhead. To achieve low-latency, live data inference, we propose sensAI, a novel and generic approach that decouples a CNN model into disconnected subnets, each is responsible for predicting certain class(es). We call this new model distribution class more...

Conference/Journal: Workshop on MLOps Systems in MLSys 2020
Publication Date: 03/04/2020
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/01/sensAI_2_pager.pdf
Author(s): Guanhua Wang, Zhuang Liu, Siyuan Zhuang, Brandon Hsieh, Joseph Gonzalez, Ion Stoica

## Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

Conference/Journal: Third Conference on Machine Learning and Systems (MLSys 2020)
Publication Date: 03/02/2020
Download: https://proceedings.mlsys.org/paper/2020/file/43ec517d68b6edd3015b3edc9a11367b-Paper.pdf
Author(s): Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica

## Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

We formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf MILP solvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1x larger input sizes.

Conference/Journal: n Proceedings of 3rd Conference Machine Learning and Systems 2020 (MLSys 2020)
Publication Date: 03/02/2020
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/04/1910.02653.pdf
Author(s): Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholaminejad, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez

## AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning

The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order–often referred to as the {\em phase-ordering} problem–is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement a framework that takes a program and finds a sequence of passes that optimize the performance of the generated circuit. Without loss of generality, we instantiate this framework in the context of an LLVM compiler and target high-level synthesis programs. We use random forests to quantify the correlation between the effectiveness of a given pass more...

Conference/Journal: Proceedings of Machine Learning and Systems 2020 (MLSys 2020)
Publication Date: 03/02/2020
Download:
Author(s): Ameer Haj Ali, Qijing Huang, William Moses, John Xiang, Krste Asanovic, John Wawrzynek, Ion Stoica

## NeuroVectorizer: end-to-end vectorization with deep reinforcement learning

Conference/Journal: CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization
Publication Date: 02/26/2020
Download:
Author(s): Ameer Haj Ali, Nesreen K. Ahmed, Ted Willke, Sagar Karandikar, Krste Asanovic, Ion Stoica

## Delphi: A Cryptographic Inference Service for Neural Networks

Many companies provide neural network prediction services to users for a wide range of applications. However, current prediction systems compromise one party’s privacy: either the user has to send sensitive inputs to the service provider for classification, or the service provider must store its proprietary neural networks on the user’s device. The former harms the personal privacy of the user, while the latter reveals the service provider’s proprietary model. We design, implement, and evaluate Delphi, a secure prediction system that allows two parties to execute neural network inference without revealing either party’s data. Delphi approaches the problem by simultaneously co-designing cryptography and machine learning. We first design a hybrid cryptographic protocol that improves upon the communication and computation costs over more...

Conference/Journal: USENIX Security 2020
Publication Date: 08/12/2020
Download:
Author(s): Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srinivasan, Wenting Zheng, Raluca Ada Popa

## IEEE Micro Top Picks 2018: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud

Conference/Journal: IEEE Micro, vol. 39, no. 3, pp. 56-65, (Micro Top Picks 2018 Issue)
Publication Date: 04/11/2019
Download:
Author(s): Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanovic

## Autoscaling Tiered Cloud Storage in Anna

In this paper, we describe how we extended a distributed key-value store called Anna into an autoscaling, multi-tier service for the cloud. In its extended form, Anna is designed to overcome the narrow cost-performance limitations typical of current cloud storage systems. We describe three key aspects of Anna’s new design: multi-master selective replication of hot keys, a vertical tiering of storage layers with different cost-performance tradeoffs, and horizontal elasticity of each tier to add and remove nodes in response to load dynamics. Anna’s policy engine uses these mechanisms to balance service-level objectives around cost, latency and fault tolerance. Experimental results explore the behavior of Anna’s mechanisms and policy, exhibiting orders of magnitude efficiency improvements over both commodity cloud KVS services and research systems

Conference/Journal: VLDB 19
Publication Date: 08/26/2019
Download:
Author(s): Chenggang Wu, Vikram Sreekanti, Joe Hellerstein

## WAVE: A Decentralized Authorization Framework with Transitive Delegation

Most deployed authorization systems rely on a central trusted service whose compromise can lead to the breach of millions of user accounts and permissions. We present WAVE, an authorization framework offering decentralized trust: no central services can modify or see permissions and any participant can delegate a portion of their permissions autonomously. To achieve this goal, WAVE adopts an expressive authorization model, enforces it cryptographically, protects permissions via a novel encryption protocol while enabling discovery of permissions, and stores them in an untrusted scalable storage solution. WAVE provides competitive performance to traditional authorization systems relying on central trust. It is an open-source artifact and has been used for two years for controlling 800 IoT devices.

Conference/Journal: 28th USENIX Security Symposium (Security 2019)
Publication Date: 08/14/2019
Download:
Author(s): Michael Andersen, Sam Kumar, Moustafa AbdelBaky, Gabe Fierro, Jack Kolb, Hyung-Sin Kim, David Culler, Raluca Ada Popa

## JEDI: Many-to-Many End-to-End Encryption and Key Delegation for IoT

As the Internet of Things (IoT) emerges over the next decade, developing secure communication for IoT devices is of paramount importance. Achieving end-to-end encryption for large-scale IoT systems, like smart buildings or smart cities, is challenging because multiple principals typically interact indirectly via intermediaries, meaning that the recipient of a message is not known in advance. This paper proposes JEDI (Joining Encryption and Delegation for IoT), a many-to-many end-to-end encryption protocol for IoT. JEDI encrypts and signs messages end-to-end, while conforming to the decoupled communication model typical of IoT systems. JEDI’s keys support expiry and fine-grained access to data, common in IoT. Furthermore, JEDI allows principals to delegate their keys, restricted in expiry or scope, to other principals, thereby granting more...

Conference/Journal: 28th USENIX Security Symposium
Publication Date: 08/14/2019
Download:
Author(s): Sam Kumar, Yuncong Hu, Michael Andersen, Raluca Ada Popa, David Culler

## Thread/OpenThread: A Compromise in Low-Power Wireless Multihop Network Architecture for the Internet of Things

Extending an Internet subnet by connecting resource-constrained nodes (e.g., embedded sensors and actuators) over multiple wireless hops is necessary to support the future Internet of Things (IoT). RPL, the IPv6 routing standard for low-power and lossy networks, tried to achieve this goal but has not seen wide adoption in practice. As an alternative, Thread is a recently standardized low-power network protocol for IoT, driven by the Thread group, an industry consortium led by Google/Nest. We provide a comparative analysis of the technical aspects of RPL and Thread based on their specifications, explaining why using Thread, as opposed to RPL, may make sense for the future Internet. Specifically, the fundamental differences between RPL and Thread are their respective scopes and multihop more...

Conference/Journal: IEEE Communications Magazine (Future Internet: Architectures and Protocols)
Publication Date: 07/18/2019
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2019/06/KIM_LAYOUT.pdf
Author(s): Hyung-Sin Kim, Sam Kumar, David Culler

## Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video

We present Accel, a novel corrective fusion network that combines (1) optical flow-based feature warping with (2) lightweight, per-frame, temporal correction to achieve state-of-the-art accuracy and throughput on video semantic segmentation.

Conference/Journal: CVPR '19 (Oral)
Publication Date: 06/16/2019
Download:
Author(s): Samvit Jain, Xin Wang, Joseph Gonzalez

## IEEE Micro Top Picks: A Hardware Accelerator for Tracing Garbage Collection

A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small hardware accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We more...

Conference/Journal: IEEE Micro, May/June 2019, Top Picks from the Computer Architecture Conferences of 2018 (to appear)
Publication Date:
Download:
Author(s): Martin Maas, Krste Asanovic, John Kubiatowicz

## ALICE: Autonomous Link-based Cell Scheduling for TSCH

Although low-power lossy network (LLN), at its early stage, commonly used asynchronous link layer protocols for simple operation on resource-constrained nodes, development of embedded hardware and time synchronization technologies made Time-Slotted Channel Hopping (TSCH) viable in LLN (now part of IEEE 802.15.4e standard). TSCH has the potential to be a link layer solution for LLN due to its resilience to wireless interference (e.g., WiFi) and multipath fading. However, its slotted operation incurs non-trivial cell scheduling overhead: two nodes should wake up at a time-frequency cell together to exchange a packet. Efficient cell scheduling in dynamic multihop topology in wireless environments has been an open issue, preventing TSCH’s wide adoption in practice. This work introduces ALICE, a novel autonomous link-based cell more...

Conference/Journal: IPSN 2019: The 18th ACM/IEEE Conference on Information Processing in Sensor Networks
Publication Date: 04/01/2019
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2019/02/ipsn19-final96.pdf
Author(s): Seohyang Kim, Hyung-Sin Kim, Chongkwon Kim

## A Hardware Accelerator for Tracing Garbage Collection

A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small hardware accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We more...

Conference/Journal: 45th International Symposium on Computer Architecture (ISCA'18)
Publication Date: 06/04/2018
Download: https://people.eecs.berkeley.edu/~maas/papers/maas-isca18-hwgc.pdf
Author(s): Martin Maas, Krste Asanovic, John Kubiatowicz

## Scaling Video Analytics to Large Camera Deployments

We discuss the potential of spatio-temporal correlations — content correlations between geographically proximate cameras in wide-area enterprise camera deployments — to improve cost efficiency and inference accuracy in large-scale video analytics operations. Our template application is real-time person re-identification and tracking.

Conference/Journal: ACM Workshop on Mobile Computing Systems and Applications (HotMobile '19)
Publication Date: 02/27/2019
Download:
Author(s): Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Joseph Gonzalez

## Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks

Confluo is an end-host stack that can be integrated with existing network management tools to enable monitoring and diagnosis of network-wide events using telemetry data distributed across end-hosts, even for high-speed networks. Confluo achieves these properties using a new data structure — Atomic MultiLog — that supports highly-concurrent read-write operations by exploiting two properties specific to telemetry data: (1) once processed by the stack, the data is neither updated nor deleted; and (2) each field in the data has a fixed pre-defined size. Our evaluation results show that, for packet sizes 128B or larger, Confluo executes thousands of triggers and tens of filters at line rate (for 10Gbps links) using a single core.

Conference/Journal: USENIX Symposium on Networked Systems Design and Implementation (NSDI'19)
Publication Date: 02/26/2019
Download:
Author(s): Anurag Khandelwal, Rachit Agarwal, Ion Stoica

## Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure

Serverless computing is poised to fulfill the long-held promise of transparent elasticity and millisecond-level pricing. To achieve this goal, service providers impose a fine-grained computational model where every function has a maximum duration, a fixed amount of memory and no persistent local storage. We observe that the fine-grained elasticity of serverless is key to achieve high utilization for general computations such as analytics workloads, but that resource limits make it challenging to implement such applications as they need to move large amounts of data between functions that don’t overlap in time. In this paper, we present Locus, a serverless analytics system that judiciously combines (1) cheap but slow storage with (2) fast but expensive storage, to achieve good performance while more...

Conference/Journal: USENIX Symposium on Networked Systems Design and Implementation (NSDI'19)
Publication Date: 02/26/2019
Download:
Author(s): Qifan Pu, Shivaram Venkataraman, Ion Stoica

## Interactive Checks for Coordination Avoidance

Strongly consistent distributed systems are easy to reason about but face fundamental limitations in availability and performance. Weakly consistent systems can be implemented with very high performance but place a burden on the application developer to reason about complex interleavings of execution. Invariant confluence provides a formal framework for understanding when we can get the best of both worlds. An invariant confluent object can be efficiently replicated with no coordination needed to preserve its invariants. However, actually determining whether or not an object is invariant confluent is challenging. In this paper, we establish conditions under which a commonly used sufficient condition for invariant confluence is both necessary and sufficient, and we use this condition to design (a) a general-purpose interactive more...

Conference/Journal: VLDB
Publication Date:
Download:
Author(s): Michael Whittaker, Joe Hellerstein

## Helen: Maliciously Secure Coopetitive Learning for Linear Models

Many organizations wish to collaboratively train machine learning models on their combined datasets for a common benefit (e.g., better medical research, or fraud detection). However, they often cannot share their plaintext datasets due to privacy concerns and/or business competition. In this paper, we design and build Helen, a system that allows multiple parties to train a linear model without revealing their data, a setting we call coopetitive learning. Compared to prior secure training systems, Helen protects against a much stronger adversary who is malicious and can compromise m − 1 out of m parties. Our evaluation shows that Helen can achieve up to five orders of magnitude of performance improvement when compared to training using an existing state-of-the-art secure multi-party more...

Conference/Journal: IEEE S&P 2019
Publication Date:
Download:
Author(s): Wenting Zheng, Raluca Ada Popa, Joseph Gonzalez, Ion Stoica

## Serverless Computing: One Step Forward, Two Steps Back

Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless ar- chitectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud—with its exabytes of storage and millions of cores—should offer to innovative developers.

Conference/Journal: CIDR '19
Publication Date: 01/15/2019
Download:
Author(s): Joe Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, Chenggang Wu

## FPGA Accelerated INDEL Realignment in the Cloud

Abstract: The amount of data being generated in genomics is predicted to be between 2 and 40 exabytes per year for the next decade, making genomic analysis the new frontier and the new challenge for precision medicine. This paper explores targeted deployment of hardware accelerators in the cloud to improve the runtime and throughput of immensescale genomic data analyses. In particular, INDEL (INsertion/DELetion) realignment is a critical operation that enables diagnostic testings of cancer through error correction prior to variant calling. It is the slowest part of the somatic (cancer) genomic analysis pipeline, the alignment refinement pipeline, and represents roughly one-third of the execution time of time-sensitive diagnostics for acute cancer patients. To accelerate genomic analysis, this paper describes a more...

Conference/Journal: IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2019
Publication Date: 02/19/2019
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2019/03/2019-hpca-ir-preprint.pdf
Author(s): Lisa Wu, David Bruns-Smith, Frank Nothaft, Qijing Huang, Sagar Karandikar, Johnny Le, Andrew Lin, Howard Mao, Brendan Sweeney, Krste Asanovic, David Patterson, Anthony Joseph

## The Case for GPU Multitenancy: The OoO VLIW JIT Compiler for GPU Inference

Conference/Journal:
Publication Date: 01/31/2019
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/04/1901.10008.pdf
Author(s): Paras Jain, Simon Mo, Ajay Jain, Alexey Tumanov, Joseph Gonzalez, Ion Stoica

## Using Multitask Learning to Improve 12-Lead Electrocardiogram Classification

NIPS Machine Learning4Health 2018

Conference/Journal: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
Publication Date: 12/04/2018
Download:
Author(s): J Weston Hughes, Taylor Sittler, Anthony Joseph, Jeffrey E Olgin, Joseph Gonzalez, Geoff Tison

## Dynamic Space-Time Scheduling for GPU Inference

Conference/Journal: NeurIPS 2018
Publication Date: 12/01/2018
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2020/04/102CameraReadySubmissionGPU_Virtualization-8.pdf
Author(s): Paras Jain, Simon Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, Ion Stoica

## SkipNet: Learning Dynamic Routing in Convolutional Networks

We introduce SkipNet, a modified residual network, that uses a gating network to selectively skip convolutional blocks based on the activations of the previous layer.

Conference/Journal: European Conference on Computer Vision (ECCV) 2018
Publication Date: 11/26/2018
Download:
Author(s): Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph Gonzalez

## Mitigating the Latency-Accuracy Trade-off in Mobile Data Analytics Systems

An increasing amount of mobile analytics is performed on data that is procured in a real-time fashion to make real-time decisions. Such tasks include simple reporting on streams to sophisticated model building. However, the practicality of these analyses are impeded in several domains because they are faced with a fundamental trade-off between data collection latency and analysis accuracy. In this paper, we first study this trade-off in the context of a specific domain, Cellular Radio Access Networks (RAN). We find that the trade-off can be resolved using two broad, general techniques: intelligent data grouping and task formulations that leverage domain characteristics. Based on this, we present CellScope, a system that applies a domain specific formulation and application of Multi-task Learning more...

Conference/Journal: 24th Annual International Conference on Mobile Computing and Networking (MobiCom)
Publication Date: 10/29/2018
Download: https://dl.acm.org/citation.cfm?id=3241581
Author(s): Anand Padmanabha Iyer, Li Erran Li, Mosharaf Chowdhury, Ion Stoica

## MARVEL: Enabling Mobile Augmented Reality with Low Energy and Low Latency

This paper presents MARVEL, a mobile augmented reality (MAR) system which provides a notation display service with imperceptible latency (<100 ms) and low energy consumption on regular mobile devices. In contrast to conventional MAR systems, which recognize objects using image-based computations performed in the cloud, MARVEL mainly utilizes a mobile device’s local inertial sensors for recognizing and tracking multiple objects, while computing local optical flow and offloading images only when necessary. We propose a system architecture which uses local inertial tracking, local optical flow, and visual tracking in the cloud synergistically. On top of that, we investigate how to minimize the overhead for image computation and offloading. We have implemented and deployed a holistic prototype system in a commercial building more...

Conference/Journal: The 16th ACM Conferenceon Embedded Networked Sensor Systems (SenSys ’18)
Publication Date: 11/04/2018
Download: http://bets.cs.berkeley.edu/publications/2018sensys_marvel.pdf
Author(s): Kaifei Chen, Tong Li, Hyung-Sin Kim, David Culler, Randy Katz

The emergence of low-power 32-bit Systems-on-Chip (SoCs), which integrate a 32-bit MCU, radio, and flash, presents an opportunity to re-examine design points and trade-offs at all levels of the system architecture of networked sensors. To this end, we develop a post-SoC/32-bit design point called Hamilton, showing that using integrated components enables a ∼$7 core and shifts hardware modularity to design time. We study the interaction between hardware and embedded OSes, identifying that (1) post-SoC motes provide lower idle current (5.9 µA) than traditional 16-bit motes, (2) 32-bit MCUs are a major energy consumer (e.g., tick increases idle current >50 times), comparable to radios, and (3) thread-based concurrency is viable, requiring only 8.3 µs of context switch time. We design a more... Conference/Journal: The 16th ACM Conference on Embedded Networked Sensor Systems (SenSys ’18) Publication Date: 11/04/2018 Download: http://bets.cs.berkeley.edu/publications/2018sensys_hamilton.pdf Author(s): Hyung-Sin Kim, Michael Andersen, Kaifei Chen, Sam Kumar, William J. Zhao, Kevin Ma, David Culler ## Benchmarks for reinforcement learning in mixed-autonomy traffic We release new benchmarks in the use of deep reinforcement learning (RL) to create controllers for mixed-autonomy traffic, where connected and autonomous vehicles (CAVs) interact with human drivers and infrastructure. Benchmarks, such as Mujoco or the Arcade Learning Environment, have spurred new research by enabling researchers to effectively compare their results so that they can focus on algorithmic improvements and control techniques rather than system design. To promote similar advances in traffic control via RL, we propose four benchmarks, based on three new traffic scenarios, illustrating distinct reinforcement learning problems with applications to mixed-autonomy traffic. We provide an introduction to each control problem, an overview of their MDP structures, and preliminary performance results from commonly used RL algorithms. For the purpose more... Conference/Journal: 2nd Conference on Robot Learning Publication Date: 10/31/2018 Download: https://rise.cs.berkeley.edu/wp-content/uploads/2018/11/Benchmarks-for-reinforcement-learning-in-mixed-autonomy-traffic.pdf Author(s): Eugene Vinitsky, Aboudy Kriedieh, Luc Le Flem, Nishant Kheterpal, Kathy Jang, Cathy Wu, Richard Liaw, Eric Liang, Alexandre Bayen ## Debugging Distributed Systems with Why-Across-Time Provenance Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event’s causal history is an overapproximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, watprovenance describes why the state machine produces a particular output when given a particular more... Conference/Journal: SOCC Publication Date: 10/11/2018 Download: Author(s): Michael Whittaker, Cristina Teodoropol, Peter Alvaro, Joe Hellerstein ## ASAP: Fast, Approximate Graph Pattern Mining at Scale While there has been a tremendous interest in processing data that has an underlying graph structure, existing distributed graph processing systems take several minutes or even hours to mine simple patterns on graphs. This paper presents ASAP, a fast, approximate computation engine for graph pattern mining. ASAP leverages state-of-the-art results in graph approximation theory, and extends it to general graph patterns in distributed settings. To enable the users to navigate the trade-off between the result accuracy and latency, we propose a novel approach to build the Error-Latency Profile (ELP) for a given computation. We have implemented ASAP on a general-purpose distributed dataflow platform, and evaluated it extensively on several graph patterns. Our experimental results show that ASAP outperforms existing exact more... Conference/Journal: USENIX Symposium on Operating Systems Design and Implementation (OSDI) Publication Date: 10/10/2018 Download: Author(s): Anand Padmanabha Iyer, Zaoxing Liu, Xin Jin, Shivaram Venkataraman, Vladimir Braverman, Ion Stoica ## Ray: A Distributed Framework for Emerging AI Applications The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray—a distributed system to address them. Ray implements a dynamic task graph computation model that supports both the task-parallel and the actor programming models. To meet the performance requirements of AI applications, we propose an architecture that logically centralizes the system’s control state using a sharded storage system and a novel bottom-up distributed scheduler. In our experiments, we demonstrate sub-millisecond remote task latencies and linear throughput scaling beyond 1.8 million tasks per second. We empirically validate that Ray speeds more... Conference/Journal: USENIX Symposium on Operating Systems Design and Implementation (OSDI) Publication Date: 10/10/2018 Download: Author(s): Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael Jordan, Ion Stoica ## FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop (AFL), has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the bugs it can find since it simply does not cover large regions of code. If it does not cover parts of the code, it will not find bugs there. We propose a two-pronged approach to increase the coverage achieved by AFL. First, the approach automatically identifies branches exercised by few AFL-produced inputs (rare branches), which often guard code that is empirically hard to cover by naively mutating inputs. The second more... Conference/Journal: 33rd ACM/IEEE International Conference on Automated Software Engineering Publication Date: 09/03/2018 Download: Author(s): Caroline Lemieux, Koushik Sen ## Context: The Missing Piece in the Machine Learning Lifecycle Machine learning models have become ubiquitous in modern applications. The ML Lifecycle describes a three-phase process used by data scientists and data engineers to develop, train, and serve models. Unfortunately, context around the data, code, people, and systems involved in these pipelines is not captured today. In this paper, we first discuss common pitfalls that missing context creates. Some examples where context is missing include tracking the relationships between code and data and capturing experimental processes over time. We then discuss techniques to address these challenges and briefly mention future work around designing and implementing systems in this space. Conference/Journal: Workshop on Common Model Infrastructure, KDD 2018 Publication Date: 08/19/2018 Download: https://rise.cs.berkeley.edu/wp-content/uploads/2019/02/Flor_CMI_18_CameraReady.pdf Author(s): Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Dan Crankshaw, Joseph Gonzalez, Joe Hellerstein ## e-mission: An Open-Source, Smartphone Platform for Collecting Human Travel Data GPS-equipped smartphones provide new methods to collect data about travel behavior, including travel survey apps that incorporate automated location sensing. Previous approaches to this have involved proprietary or one-off tools that are inconsistent and difficult to evaluate. In contrast, e-mission is an open-source, extensible software platform that consists of (a) an app for survey participants to install on their Android or iOS smartphones and (b) cloud-hosted software for managing the collected data. e-mission collects continuous location data, user-initiated annotations, and responses to contextual, platform initiated survey questions. New studies can be set up using the existing University of California, Berkeley, infrastructure with no additional coding, or the platform can be extended for more complex projects. This paper reviews the requirements more... Conference/Journal: Transportation Research Record: Journal of the Transportation Research Board Publication Date: 08/19/2018 Download: https://rise.cs.berkeley.edu/wp-content/uploads/2018/10/emission_mobilityscope.pdf Author(s): K. Shankari, Mohamed Amine Bouzaghrane, Samuel M. Maurer, Paul Waddell, David Culler, Randy Katz ## 3Sigma: distribution-based cluster scheduling for runtime uncertainty The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8–23% of predictions off by a factor of two or more. Instead of more... Conference/Journal: ACM European Conference on Computer Systems (EuroSys'2018) Publication Date: 04/23/2018 Download: http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/eurosys18-final49.pdf Author(s): Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, Gregory R. Ganger ## Parametrized Hierarchical Procedures for Neural Programming Parametrized Hierarchical Procedures (PHP) represent a program as a hierarchy of procedures that call each other, each implemented by a neural network. We develop an algorithm for training PHPs from a set of supervisor demonstrations, only few of which are annotated with the internal call structure. Conference/Journal: International Conferences on Learning Representations (ICLR) 2018 Publication Date: 04/30/2018 Download: Author(s): Roy Fox, Eui Chul (Richard) Shin, Sanjay Krishnan, Kenneth Goldberg, Dawn song, Ion Stoica ## Towards Specification-Directed Program Repair Several recent papers have developed neural network program synthesizers by using supervised learning over large sets of randomly generated programs and specifications. In this paper, we investigate the feasibility of this approach for program repair: given a specification and a candidate program assumed similar to a correct program for the specification, synthesize a program which meets the specification. Working in the Karel domain with a dataset of synthetically generated candidates, we develop models that can make effective use of the extra information in candidate programs, achieving 40% error reduction compared to a baseline program synthesis model that only receives the specification and not a candidate program. Conference/Journal: International Conferences on Learning Representations (ICLR) 2018, workshop track Publication Date: 04/30/2018 Download: Author(s): Eui Chul (Richard) Shin, Illia Polosukhin, Dawn song ## Tree-to-tree Neural Networks for Program Translation Program translation is an important tool to migrate legacy code in one language into an ecosystem built in a different language. In this work, we are the first to consider employing deep neural networks toward tackling this problem. We observe that program translation is a modular procedure, in which a sub-tree of the source tree is translated into the corresponding target sub-tree at each step. To capture this intuition, we design a tree-to-tree neural network as an encoder-decoder architecture to translate a source tree into a target one. Meanwhile, we develop an attention mechanism for the tree-to-tree model, so that when the decoder expands one non-terminal in the target tree, the attention mechanism locates the corresponding sub-tree in the source more... Conference/Journal: International Conferences on Learning Representations (ICLR) 2018, workshop track Publication Date: 04/30/2018 Download: Author(s): Xinyun Chen, Chang Liu, Dawn song ## Towards Synthesizing Complex Programs From Input-Output Examples In recent years, deep learning techniques have been developed to improve the performance of program synthesis from input-output examples. Albeit its significant progress, the programs that can be synthesized by state-of-the-art approaches are still simple in terms of their complexity. In this work, we move a significant step forward along this direction by proposing a new class of challenging tasks in the domain of program synthesis from input-output examples: learning a context-free parser from pairs of input programs and their parse trees. We show that this class of tasks are much more challenging than previously studied tasks, and the test accuracy of existing approaches is almost 0%. We tackle the challenges by developing three novel techniques inspired by three novel more... Conference/Journal: International Conferences on Learning Representations (ICLR) 2018 Publication Date: 04/30/2018 Download: Author(s): Xinyun Chen, Chang Liu, Dawn song ## Do Not Lose Bandwidth: Adaptive Transmission Power and Multihop Topology Control We show that a multihop wireless network can achieve better bandwidth and routing stability when transmission power and routing topology are jointly and adaptively controlled. Our experiments show that the predominant ‘fixed and uniform’ transmission power strategy with ‘link quality and hop distance’-based routing topology construction loses significant bandwidth due to hidden terminal and load imbalance problems. We design an adaptive and distributed control mechanism for transmission power and routing topology, PC-RPL, within the standard RPL routing protocol. We implement PC-RPL on real embedded devices and evaluate its performance on a 49-node multihop testbed. PC-RPL reduces total end-to-end packet losses ~7-fold without increasing hop distance compared to RPL with the highest transmission power at heavy load, resulting in 17% improvement in aggregate bandwidth and 64% for the worst-case node. Conference/Journal: 13th International Conference on Distributed Computing in Sensor Systems (DCOSS 2017) Publication Date: 06/07/2017 Download: http://bets.cs.berkeley.edu/publications/2017dcoss_pcrpl.pdf Author(s): Hyung-Sin Kim, Jeongyeup Paek, David Culler, Saewoong Bahk ## Challenging the IPv6 Routing Protocol for Low-Power and Lossy Networks (RPL): A Survey RPL is the IPv6 routing protocol for low-power and lossy networks, standardized by IETF in 2012 as RFC6550. Specifically, RPL is designed to be a simple and inter-operable networking protocol for resource-constrained devices in industrial, home, and urban environments, intended to support the vision of the Internet of Things with thousands of devices interconnected through multihop mesh networks. More than four-years have passed since the standardization of RPL, and we believe that it is time to examine and understand its current state. In this paper, we review the history of research efforts in RPL; what aspects have been (and have not been) investigated and evaluated, how they have been studied, what was (and was not) implemented, and what remains for more... Conference/Journal: IEEE Communications Surveys & Tutorials ( Volume: 19, Issue: 4, Fourthquarter 2017 ) Publication Date: 09/13/2017 Download: http://bets.cs.berkeley.edu/publications/2017comst_rplsurvey.pdf Author(s): Hyung-Sin Kim, Jeonggil Ko, David Culler, Jeongyeup Paek ## IDK Cascades: Fast Deep Learning by Learning not to Overthink We introduce the "I Don't Know"(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. Conference/Journal: Conference on Uncertainty in Artificial Intelligence (UAI) 2018 Publication Date: 07/18/2018 Download: Author(s): Xin Wang, Yujia Luo, Dan Crankshaw, Alexey Tumanov, Fisher Yu, Joseph Gonzalez ## PerfFuzz: Automatically Generating Pathological Inputs Performance problems in software can arise unexpectedly when programs are provided with inputs that exhibit worst-case behavior. A large body of work has focused on diagnosing such problems via statistical profiling techniques. But how does one find these inputs in the first place? We present PerfFuzz, a method to automatically generate inputs that exercise pathological behavior across program locations, without any domain knowledge. PerfFuzz generates inputs via feedback-directed mutational fuzzing. Unlike previous approaches that attempt to maximize only a scalar characteristic such as the total execution path length, PerfFuzz uses multi-dimensional feedback and independently maximizes execution counts for all program locations. This enables PerfFuzz to (1) find a variety of inputs that exercise distinct hot spots in a program and more... Conference/Journal: 27th International Symposium on Software Testing and Analysis Publication Date: 07/12/2018 Download: Author(s): Caroline Lemieux, Rohan Padhye, Koushik Sen, Dawn song ## Tributary: spot-dancing for elastic services with latency SLOs The Tributary elastic control system embraces the uncertain nature of transient cloud resources, such as AWS spot instances, to manage elastic services with latency SLOs more robustly and more cost-effectively. Such resources are available at lower cost, but with the proviso that they can be preempted en masse, making them risky to rely upon for business-critical services. Tributary creates models of preemption likelihood and exploits the partial independence among different resource offerings, selecting collections of resource allocations that satisfy SLO requirements and adjusting them over time, as client workloads change. Although Tributary’s collections are often larger than required in the absence of preemptions, they are cheaper because of both lower spot costs and partial refunds for preempted resources. At the more... Conference/Journal: 2018 USENIX Annual Technical Conference Publication Date: 07/11/2018 Download: https://www.usenix.org/system/files/conference/atc18/atc18-harlap.pdf Author(s): Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger, Phillip B. Gibbons ## Towards Fast and Scalable Graph Pattern Mining While there has been a tremendous interest in processing graph-structured data, existing distributed graph processing systems take several minutes or even hours to mine simple patterns on graphs. In this paper, we try to answer the question of whether it is possible to build a graph pattern mining engine that is both fast and scalable. Leveraging the observation that in several pattern mining tasks, providing an approximate answer is good enough, we propose the use of approximation for graph pattern mining. However, we find that existing approximation techniques do not work for this purpose. Based on this, we present a new approach for approximate graph pattern mining that leverages recent advancements in graph approximation theory. Our preliminary evaluations show encouraging more... Conference/Journal: 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '18) Publication Date: 07/09/2018 Download: Author(s): Anand Padmanabha Iyer, Zaoxing Liu, Xin Jin, Shivaram Venkataraman, Vladimir Braverman, Ion Stoica ## Monarch: Gaining Command on Geo-Distributed Graph Analytics A number of existing and emerging application scenarios generate graph-structured data in a geo-distributed fashion. Although there is a lot of interest in distributed graph processing systems, none of them support graphs that are geo-distributed. Geo-distributed analytics, on the other hand, has not focused on iterative workloads such as distributed graph processing. In this paper, we look at the problem of efficient geo-distributed graph analytics. We find that optimizing the iterative processing style of graph-parallel systems is the key to achieving this goal rather than extending existing geo-distributed techniques to graph processing. Based on this, we discuss our proposal on building Monarch, the first system to our knowledge that focuses on geo-distributed graph processing. Our preliminary evaluation of Monarch shows more... Conference/Journal: 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '18) Publication Date: 07/09/2018 Download: Author(s): Anand Padmanabha Iyer, Aurojit Panda, Mosharaf Chowdhury, Aditya Akella, Scott Shenker, Ion Stoica ## Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free “shift” operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation’s efficacy, we replace ResNet’s 3×3 convolutions with shift-based modules for improved CIFAR10 and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation’s resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation’s applicability across domains, achieving strong performance with more... Conference/Journal: Conference on Computer Vision and Pattern Recognition (CVPR) Publication Date: 06/18/2018 Download: https://arxiv.org/pdf/1711.08141.pdf Author(s): Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, Kurt Keutzer ## Bridging the GAP: Towards Approximate Graph Analytics While there has been a tremendous interest in processing data that has an underlying graph structure, existing distributed graph processing systems take several minutes or even hours to execute popular graph algorithms. However, in several cases, providing an approximate answer is good enough. Approximate analytics is seeing considerable attention in big data due to its ability to produce timely results by trading accuracy, but they do not support graph analytics. In this paper, we bridge this gap and take a first attempt at realizing approximate graph analytics. We discuss how traditional approximate analytics techniques do not carry over to the graph usecase. Leveraging the characteristics of graph properties and algorithms, we propose a graph sparsification technique, and a machine learning more... Conference/Journal: SIGMOD Graph Data-management Experiences & Systems (GRADES) Publication Date: 06/10/2018 Download: Author(s): Anand Padmanabha Iyer, Aurojit Panda, Shivaram Venkataraman, Mosharaf Chowdhury, Aditya Akella, Scott Shenker, Ion Stoica ## FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud We present FireSim, an open-source simulation platform that enables cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation. Unlike prior FPGA-accelerated simulation tools, FireSim runs on Amazon EC2 F1, a public cloud FPGA platform, which greatly improves usability, provides elasticity, and lowers the cost of large-scale FPGA-based experiments. We describe the design and implementation of FireSim and show how it can provide sufficient performance to run modern applications at scale, to enable true hardware-software co-design. As an example, we demonstrate automatically generating and deploying a target cluster of 1,024 3.2 GHz quad-core server nodes, each with 16 GB of DRAM, interconnected by a 200 Gbit/s network with 2 microsecond latency, which simulates at a 3.4 MHz processor clock rate (less than 1,000x slowdown over real-time). In aggregate, this FireSim instantiation simulates 4,096 cores and 16 TB of memory, runs ̃14 billion instructions per second, and harnesses 12.8 million dollars worth of FPGAs—at a total cost of only ̃$100 per simulation hour to the user. We present several examples to show how FireSim can be used to explore various research directions in warehouse-scale machine design, including modeling networks with high-bandwidth and low-latency, integrating arbitrary RTL designs for a variety of commodity and specialized datacenter nodes, and modeling a variety of datacenter organizations, as well as reusing the scale-out FireSim infrastructure to enable fast, massively parallel cycle-exact single-node microarchitectural experimentation.

Conference/Journal: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA 2018)
Publication Date: 06/04/2018
Download:
Author(s): Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanovic

## DIZK: A Distributed Zero Knowledge Proof System

Recently there has been much academic and industrial interest in practical implementations of zero knowledge proofs. These techniques allow a party to prove to another party that a given statement is true without revealing any additional information. In a Bitcoin-like system, this allows a payer to prove validity of a payment without disclosing the payment’s details. Unfortunately, the existing systems for generating such proofs are very expensive, especially in terms of memory overhead. Worse yet, these systems are “monolithic”, so they are limited by the memory resources of a single machine. This severely limits their practical applicability. We describe DIZK, a system that distributes the generation of a zero knowledge proof across machines in a compute cluster. Using a set more...

Conference/Journal: USENIX Security 2018
Publication Date:
Download:
Author(s): Howard Wu, Wenting Zheng, Alessandro Chiesa, Raluca Ada Popa, Ion Stoica

## NetChain: Scale-Free Sub-RTT Coordination

Coordination services are a fundamental building block of modern cloud systems, providing critical functionalities like configuration management and distributed locking. The major challenge is to achieve low latency and high throughput while providing strong consistency and fault-tolerance. Traditional server-based solutions require multiple round-trip times (RTTs) to process a query. This paper presents NetChain, a new approach that provides scale-free sub-RTT coordination in datacenters. NetChain exploits recent advances in programmable switches to store data and process queries entirely in the network data plane. This eliminates the query processing at coordination servers and cuts the end-to-end latency to as little as half of an RTT—clients only experience processing delay from their own software stack plus network delay, which in a datacenter setting more...

Conference/Journal: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI '18)
Publication Date: 04/09/2018
Download:
Author(s): Xin Jin, Ion Stoica, Xiaozhou Li, Haoyu Zhang , Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim

## SafeBricks: Shielding Network Functions in the Cloud

With the advent of network function virtualization (NFV), outsourcing network processing to the cloud is growing in popularity amongst enterprises and organizations. Such outsourcing, however, poses a threat to the security of the client’s traffic because the cloud is notoriously susceptible to attacks. We present SafeBricks, a system that shields generic network functions (NFs) from an untrusted cloud. SafeBricks ensures that only encrypted traffic is exposed to the cloud provider, and preserves the integrity of both traffic and the NFs. At the same time, it enables clients to reduce their trust in NF implementations by enforcing least privilege across NFs deployed in a chain. SafeBricks does not require changes to TLS, and safeguards the interests of NF vendors as well more...

Conference/Journal: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2018
Publication Date: 04/09/2018
Download:
Author(s): Rishabh Poddar, Chang Lan, Raluca Ada Popa, Sylvia Ratnasamy

## Oblix: An Efficient Oblivious Search Index

Search indices are fundamental building blocks of many systems, and there is great interest in running them on encrypted data. Unfortunately, many known schemes that enable search queries on encrypted data achieve efficiency at the expense of security, as they reveal access patterns to the encrypted data. In this paper we present Oblix, a search index for encrypted data that is oblivious (provably hides access patterns), is dynamic (supports inserts and deletes), and has good efficiency. Oblix relies on a combination of novel oblivious-access techniques and recent hardware enclave platforms (e.g., Intel SGX). In particular, a key technical contribution is the design and implementation of doubly-oblivious data structures, in which the client’s accesses to its internal memory are oblivious, in more...

Conference/Journal: 39th IEEE Symposium on Security and Privacy
Publication Date: 05/21/2018
Download:
Author(s): Pratyush Mishra, Rishabh Poddar, Jerry Chen, Alessandro Chiesa, Raluca Ada Popa

## Design and Analysis of a Query Processor for Brick

Brick is a recently proposed metadata schema and ontology for describing building components and the relationships between them. It represents buildings as directed labeled graphs using the RDF data model. Using the SPARQL query language, building-agnostic applications query a Brick graph to discover the set of resources and relationships they require to operate. Latency-sensitive applications, such as user interfaces, demand response and modelpredictive control, require fast queries — conventionally less than 100ms. We benchmark a set of popular open-source and commercial SPARQL databases against three real Brick models using seven application queries and find that none of them meet this performance target. This lack of performance can be attributed to design decisions that optimize for queries over large graphs consisting more...

Conference/Journal: 4th ACM International Conference on Systems for Energy-Efficient Built Environments (BuildSys), 2017
Publication Date: 11/09/2017
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2018/01/hoddb.pdf
Author(s): Gabe Fierro, Rishabh Poddar

## Blink: A fast NVLink-based collective communication library

Efficiently training large deep learning models requires scaling training across a number of GPUs. When training at scale, synchronizing parameters across GPUs introduces significant overheads. To improve synchronization performance, recently available hardware like NVIDIA-DGX1 introduce support for high bandwidth NVLinks across GPUs and software libraries like NCCL implement collective communication primitives like broadcast, all-reduce. However NCCL uses ring-based protocols which do not always use all the available links. To achieve better link utilization, we propose Blink, a family of protocols that use a broadcast-based data transfer mechanism. We describe an AllReduce protocol for the DGX-1 machine and present initial benchmark results that show that Blink can achieve a 2x speedup when compared to NCCL 2.

Conference/Journal: SysML 2018
Publication Date: 02/15/2018
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2018/01/blink-2-page-11_50.pdf
Author(s): Guanhua Wang, Amar Phanishayee, Shivaram Venkataraman, Ion Stoica

## The emission mobilityscope: personalized data collection for agile urban planning

We have built an initial version of an open source mobilityscope or transportation meter that can track end to end travel patterns using smartphone sensors and user input. The travel patterns can be used to encourage sustainable transportation choices at the personal level, and to plan infrastructure that removes barriers to energy efficient travel at the aggregate level. The mobilityscope is easily customized to the needs of a specific campus, and is integrated with a game that can be used for tailored challenges.

Conference/Journal: California Higher Education Sustainability Conference (CHESC), Santa Barbara CA, June 2016
Publication Date:
Download:
Author(s): K. Shankari, Cathy Wu

## Optimizing the diamond lane: A more tractable carpool problem and algorithms

Carpooling has been long deemed a promising approach to better utilizing existing transportation infrastructure. However, there are several reasons carpooling is still not the preferred mode of commute in the United States: first, complex human factors, including trust, compatibility, and not having right incentive structures, discourage the sharing of rides; second, algorithmic and technical barriers inhibit the development of online services for matching riders. High-occupancy vehicle (HOV) lanes which permit vehicles that hold three or more people (HOV3+) have been seen to simultaneously decrease trust concerns and dramatically reduce travel times, thereby providing a promising avenue for addressing both types of issues. The goal of this article is to present algorithms for optimizing the use of HOV3+ lanes. We first more...

Conference/Journal: 19th IEEE Intelligent Transportation Systems Conference
Publication Date:
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2018/01/diamond-lane-ITSC16.pdf
Author(s): K. Shankari, Ece Kamar, Randy Katz, Raluca Ada Popa, Christos Papadimitriou, Eric Horvitz, Alexandre Bayen

## RLlib: Abstractions for Distributed Reinforcement Learning

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available at https://github.com/ray-project/ray.

Conference/Journal: International Conference on Machine Learning (ICML 2018)
Publication Date: 12/07/2017
Download:
Author(s): Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Kenneth Goldberg, Joseph Gonzalez, Michael Jordan, Ion Stoica

## SnapLink: Fast and Accurate Vision-Based Appliance Control in Large Commercial Buildings.

As the number and heterogeneity of appliances in smart buildings increases, identifying and controlling them becomes challenging. Existing methods face various challenges when deployed in large commercial buildings. For example, voice command assistants require users to memorize many control commands. Attaching Bluetooth dongles or QR codes to appliances introduces considerable deployment overhead. In comparison, identifying an appliance by simply pointing a smartphone camera at it and controlling the appliance using a graphical overlay interface is more intuitive. We introduce SnapLink, a responsive and accurate vision-based system for mobile appliance identification and interaction using image localization. Compared to the image retrieval approaches used in previous vision-based appliance control systems, SnapLink exploits 3D models to improve identification accuracy and reduce deployment overhead more...

Conference/Journal: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, No. 4
Publication Date: 12/01/2017
Download:
Author(s): Kaifei Chen, Jonathan Fürst, Jack Kolb, Hyung-Sin Kim, Xin Jin, David Culler, Randy Katz

## DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations

DDCO learns form demonstrations continuous control skills parametrized by deep networks, their termination condition, and high-level policies to use them.

Conference/Journal: Proceedings of the 1st Annual Conference on Robot Learning (CoRL) 2017
Publication Date: 11/13/2017
Download:
Author(s): Sanjay Krishnan, Roy Fox, Ion Stoica, Kenneth Goldberg

## Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging

We address the statistical and optimization impacts of using classical sketch versus Hessian sketch to solve approximately the Matrix Ridge Regression (MRR) problem. Prior research has considered the effects of classical sketch on least squares regression (LSR), a strictly simpler problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does on those of LSR—namely, it recovers nearly optimal solutions. In contrast, Hessian sketch does not have this guarantee, instead, the approximation error is governed by a subtle interplay between the “mass” in the responses and the optimal objective value. For both types of approximations, the regularization in the sketched MRR problem gives it significantly different statistical properties from the sketched LSR more...

Conference/Journal: International Conference on Machine Learning
Publication Date: 08/07/2017
Download:
Author(s): Shusen Wang, Alex Gittens, Michael Mahoney

## Full-System Simulation of Java Workloads with RISC-V and the Jikes Research Virtual Machine

Managed languages such as Java, JavaScript or Python account for a large portion of workloads, both in cloud data centers and on mobile devices. It is therefore unsurprising that there is an interest in hardware-software co-design for these languages. However, existing research infrastructure is often unsuitable for this kind of research: managed languages are sensitive to fine-grained interactions that are not captured by high-level architectural models, yet are also too long-running and irregular to be simulated using cycle-accurate software simulators. Open-source hardware based on the RISC-V ISA provides an opportunity to solve this problem, by running managed workloads on RISC-V systems in FPGA-based full-system simulation. This approach achieves both the accuracy and simulation speeds required for managed workloads, while enabling more...

Conference/Journal: 1st Workshop on Computer Architecture Research with RISC-V (CARRV '17), Boston, MA, October 2017
Publication Date: 10/14/2017
Download:
Author(s): Martin Maas, Krste Asanovic, John Kubiatowicz

## A Berkeley View of Systems Challenges for AI

In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI’s potential to improve lives and society.

Conference/Journal: UC Berkeley EECS Technical Report
Publication Date: 10/16/2017
Download:
Author(s): Ion Stoica, Dawn song, Raluca Ada Popa, Kenneth Goldberg, Michael Mahoney, Randy Katz, Anthony Joseph, Michael Jordan, Joe Hellerstein, et al.

## Finite-Size Corrections and Likelihood Ratio Fluctuations in the Spiked Wigner Model

In this paper we study principal components analysis in the regime of high dimensionality and high noise. Our model of the problem is a rank-one deformation of a Wigner matrix where the signal-to-noise ratio (SNR) is of constant order, and we are interested in the fundamental limits of detection of the spike. Our main goal is to gain a fine understanding of the asymptotics for the log-likelihood ratio process, also known as the free energy, as a function of the SNR. Our main results are twofold. We first prove that the free energy has a finite-size correction to its limit—the replica-symmetric formula—which we explicitly compute. This provides a formula for the Kullback-Leibler divergence between the planted and null models. Second, more...

Conference/Journal:
Publication Date:
Download:
Author(s): Ahmed El Alaoui, Florent Krzakala, Michael Jordan

## Drizzle: Fast and Adaptable Stream Processing at Scale

Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24×7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload more...

Conference/Journal: Symposium on Operating Systems Principles (SOSP) 2017
Publication Date: 10/28/2017
Download:
Author(s): Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Mike Franklin, Benjamin Recht, Ion Stoica

## Capacity Releasing Diffusion for Speed and Locality

Diffusions and related random walk procedures are of central importance in many areas of machine learning, data analysis, and applied mathematics. Because they spread mass agnostically at each step in an iterative manner, they can sometimes spread mass “too aggressively,” thereby failing to find the “right” clusters. We introduce a novel Capacity Releasing Diffusion (CRD) Process, which is both faster and stays more local than the classical spectral diffusion process. As an application, we use our CRD Process to develop an improved local algorithm for graph clustering. Our local graph clustering method can find local clusters in a model of clustering where one begins the CRD Process in a cluster whose vertices are connected better internally than externally by an more...

Conference/Journal: Proceedings of the 34th International Conference on Machine Learning (ICML '17)
Publication Date: 07/28/2017
Download:
Author(s): Shivaram Venkataraman, Di Wang, Kimon Fountoulakis, Monika Henzinger, Michael W. Mahoney, Satish Rao

## A Scalable Distributed Spatial Index for the Internet-of-Things

The increasing interest in the Internet-of-Things (IoT) suggest that a new source of big data is imminent—they would likely be produced by machines and sensors in the IoT ecosystem. The fundamental characteristic of the data produced by these sources is that they are inherently geospatial in nature. In addition, they exhibit unprecedented and unpredictable skew. Thus, big data systems designed for IoT applications must be able to efficiently ingest, index and query spatial data having heavy and unpredictable skew. Spatial indexing is well explored area of research in literature, but little or no attention has been given to the topic of efficient distributed spatial indexing. In this paper, we propose SIFT, a distributed spatial index and its implementation. Unlike systems more...

Conference/Journal: ACM Symposium on Cloud Computing 2017 (SoCC '17)
Publication Date: 09/25/2017
Download:
Author(s): Anand Padmanabha Iyer, Ion Stoica

## Learning certifiably optimal rule lists for categorical data

As machine learning continues to gain prominence in socially-important decision-making, the interpretability of predictive models remains a crucial problem. Our goal is to build models that are highly predictive, transparent, and easily understood by humans. We use rule lists, also known as decision lists, to achieve this goal. Rule lists are lists composed of if-then statements, which are easily interpreted; the rules give a reason for each prediction. We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and more...

Conference/Journal: 23rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '17)
Publication Date: 08/15/2017
Download:
Author(s): Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, Cynthia Rudin

## Selecting the Best VM across Multiple Public Clouds: A Data-Driven Performance Modeling Approach

Users of cloud services are presented with a bewildering choice of VM types and the choice of VM can have significant implications on performance and cost. In this paper we address the fundamental problem of accurately and economically choosing the best VM for a given workload and user goals. To address the problem of optimal VM selection, we present PARIS, a data-driven system that uses a novel hybrid offline and online data collection and modeling framework to provide accurate performance estimates with minimal data collection. PARIS is able to predict workload performance for different user-specified metrics, and resulting costs for a wide range of VM types and workloads across multiple cloud providers. When compared to a sophisticated baseline linear interpolation more...

Conference/Journal: ACM Symposium on Cloud Computing 2017 (SoCC '17)
Publication Date: 09/25/2017
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2017/07/paris_socc17.pdf
Author(s): Neeraja Yadwadkar, Bharath Hariharan, Joseph Gonzalez, Burton Smith, Randy Katz

## Data Tweening: Incremental Visualization of Data Transforms

In the context of interactive query sessions, it is common to issue a succession of queries, transforming a dataset to the desired result. It is often difficult to comprehend a succession of transformations, especially for complex queries. Thus, to facilitate understanding of each data transformation and to provide continuous feedback, we introduce the concept of “data tweening”, i.e., interpolating between resultsets, presenting to the user a series of incremental visual representations of a resultset transformation. We present tweening methods that consider not just the changes in the result, but also the changes in the query. Through user studies, we show that data tweening allows users to efficiently comprehend data transforms, and also enables them to gain a better understanding of more...

Conference/Journal: PVLDB
Publication Date: 07/01/2017
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2017/07/p661-khan.pdf
Author(s): Neeraja Yadwadkar, Larry Xu, Arnab Nandi, Joe Hellerstein

## Automating Diagnosis of Cellular Radio Access Network Problems

In an increasingly mobile connected world, our user experience of mobile applications more and more depends on the performance of cellular radio access networks (RANs). To achieve high quality of experience for the end user, it is imperative that operators effectively identify and diagnose performance problems quickly. In this paper, we describe our experience in understanding the challenges in automating the detection and diagnosis of performance problems in RANs. Working with a major cellular network operator on a part of their RAN that services more than 2 million users, we demonstrate that fine-grained modeling and analysis is the key towards this goal. We describe our methodology in analyzing RAN problems, and highlight a few of our findings, some previously unknown. We more...

Conference/Journal: 23rd ACM Annual International Conference on Mobile Computing and Networking (MobiCom 2017)
Publication Date: 10/16/2017
Download:
Author(s): Anand Padmanabha Iyer, Li Erran Li, Ion Stoica

## Occupy the Cloud: Distributed Computing for the 99%

Distributed computing remains inaccessible to a large number of users, in spite of many open source platforms and extensive commercial offerings. While distributed computation frameworks have moved beyond a simple map-reduce model, many users are still left to struggle with complex cluster management and configuration tools, even for running simple embarrassingly parallel jobs. We argue that stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity. Furthermore, using our prototype implementation, PyWren, we show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Extrapolating from recent trends in network bandwidth and the advent of disaggregated storage, we suggest that stateless functions are a more...

Conference/Journal: ACM Symposium on Cloud Computing (SoCC) 2017
Publication Date: 09/25/2017
Download:
Author(s): Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, Benjamin Recht

## Return of the Runtimes: Rethinking the Language Runtime System for the Cloud 3.0 Era

The public cloud is moving to a Platform-as-a-Service model where services such as data management, machine learning or image classification are provided by the cloud operator while applications are written in high-level languages and leverage these services. Managed languages such as Java, Python or Scala are widely used in this setting. However, while these languages can increase productivity, they are often associated with problems such as unpredictable garbage collection pauses or warm-up overheads. We argue that the reason for these problems is that current language runtime systems were not initially designed for the cloud setting. To address this, we propose seven tenets for designing future language runtime systems for cloud data centers. We then outline the design of a general more...

Conference/Journal: 16th Workshop on Hot Topics in Operating Systems (HotOS '17)
Publication Date: 05/09/2017
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2017/07/cloud30-hotos17.pdf
Author(s): Martin Maas, Krste Asanovic, John Kubiatowicz

## EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding

Data-intensive clusters and object stores are increasingly relying on in-memory object caching to meet the I/O performance demands. These systems routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. In this paper, we explore an alternative approach using erasure coding. EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs erasure coding by: (i) splitting and erasure coding individual objects during writes, and (ii) late binding, wherein obtaining any k out of (k+r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.

Conference/Journal: USENIX OSDI
Publication Date: 11/03/2016
Download:
Author(s): Martin Maas, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, Kannan Ramchandran

## Breaking Locality Accelerates Block Gauss-Seidel

We analyze the accelerated block Gauss-Seidel algorithm in the random coordinate sampling setting

Conference/Journal: International Conference on Machine Learning (ICML 2017)
Publication Date: 06/22/2017
Download:
Author(s): Shivaram Venkataraman, Stephen Tu, Ashia C. Wilson, Alex Gittens, Michael Jordan, Benjamin Recht

## Real-Time Machine Learning: The Missing Pieces

Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution frameworks: computation with millisecond latency at high throughput, adaptive construction of arbitrary task graphs, and execution of heterogeneous kernels over diverse sets of resources. We assert that a new distributed execution framework is needed for such ML applications and propose a candidate approach with a proof-of-concept architecture that achieves a 63x performance improvement over a state-of-the-art execution framework for a representative application.

Conference/Journal: HotOS 2017
Publication Date: 05/10/2017
Download:
Author(s): Robert Nishihara, Philipp Moritz, Stephanie Wang, Alexey Tumanov, William Paul, Johann Schleier-Smith, Richard Liaw, Mehrdad Niknami, Michael Jordan, Ion Stoica

## ZipG: A Memory-efficient Graph Store for Interactive Queries

ZipG is a distributed memory-efficient graph store for serving interactive graph queries.

Conference/Journal: SIGMOD '17
Publication Date: 05/14/2017
Download:
Author(s): Anurag Khandelwal, Zongheng Yang, Evan Ye, Rachit Agarwal, Ion Stoica

## Grail Quest: A New Proposal for Hardware-assisted Garbage Collection

Many big data systems are written in garbage-collected languages and GC has a substantial impact on throughput, responsiveness and predicability of these systems. However, despite decades of research, there is still no “Holy Grail” of GC: a collector with no measurable impact, even on real-time applications. Such a collector needs to achieve freedom from pauses, high GC throughput and good memory utilization, without slowing down application threads or using substantial amounts of compute resources. In this paper, we propose a step towards this elusive goal by reviving the old idea of moving GC into hardware. We discuss the trends that make it the perfect time to revisit this approach and present the design of a hardware-assisted GC that aims to more...

Conference/Journal: Sixth Workshop on Architectures and Systems for Big Data (ASBD 2016)
Publication Date: 03/19/2017
Download:
Author(s): Martin Maas, Krste Asanovic, John Kubiatowicz

## Putting logic-based distributed systems on stable grounds

In the Declarative Networking paradigm, Datalog-like languages are used to express distributed computations. Whereas recently formal operational semantics for these languages have been developed, a corresponding declarative semantics has been lacking so far. The challenge is to capture precisely the amount of nondeterminism that is inherent to distributed computations due to concurrency, networking delays, and asynchronous communication. This paper shows how a declarative, model-based semantics can be obtained by simply using the well-known stable model semantics for Datalog with negation. We show that the model-based semantics matches previously proposed formal operational semantics.

Conference/Journal: Theory and Practice of Logic Programming, Volume 16, Issue 4
Publication Date:
Download:
Author(s): Joe Hellerstein, TOM J. AMELOOT, JAN VAN DEN BUSSCHE, WILLIAM R. MARCZAK, PETER ALVARO

## SparkR: Scaling R Programs with Spark

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine’s memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

Conference/Journal:
Publication Date:
Download:
Author(s): Shivaram Venkataraman, Ali Ghodsi, Ion Stoica

## Decentralized Anonymous Micropayments

Anonymity-preserving micropayments for Bitcoin-like currencies.

Conference/Journal: EUROCRYPT 2017
Publication Date: 04/30/2017
Download:
Author(s): Alessandro Chiesa, Matthew Green, Jingcheng Liu, Peihan Miao, Ian Miers, Pratyush Mishra

## MiniCrypt: Reconciling Encryption and Compression for Big Data Stores.

More and more applications and web services generate larger and larger amounts of confidential data, such as user and financial data. On one hand, these systems must use encryption to ensure confidentiality, while on the other hand, they want to use compression to reduce costs and increase performance. Unfortunately, encryption and compression are in tension, leading many existing systems to support one but not the other. We propose MiniCrypt,  the first big data keyvalue store that reconciles encryption and compression, without compromising performance.  At the core of MiniCrypt is an observation on data compressibility trends in key-value stores, which enables grouping key-value pairs in small key packs, together with a set of new distributed systems techniques for retrieving, updating,  merging more...

Conference/Journal: EuroSys 2017
Publication Date:
Download:
Author(s): Wenting Zheng, Raluca Ada Popa, Ion Stoica, Rachit Agarwal, Frank Li

## Opaque: An Oblivious and Encrypted Distributed Analytics Platform.

As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage. We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. more...

Conference/Journal: NSDI 2017 (USENIX Symposium of Networked Systems Design and Implementation)
Publication Date: 03/27/2017
Download:
Author(s): Wenting Zheng, Raluca Ada Popa, Ion Stoica, Joseph Gonzalez, Ankur Dave, Jethro Beekman

## Random Projection Design for Scalable Implicit Smoothing of Randomly Observed Stochastic Processes

Standard methods for multi-variate time series analysis are hampered by sampling at random timestamps, long range dependencies , and the scale of the data. In this paper we present a novel estimator for cross-covariance of randomly observed time series which identifies the dynamics of an unobserved stochastic process. We analyze the statistical properties of our estimator without the assumption that observation timestamps are independent from the process of interest and show that our solution does not suffer from the corresponding issues affecting standard estimators for cross-covariance. We implement and evaluate our statistically sound and scalable approach in the distributed setting using Apache Spark and demonstrate its ability to identify interactions between processes on simulations and financial data with tens of millions of samples. Pdf: Aistats_camera_ready

Conference/Journal: AISTATS 2017
Publication Date:
Download:
Author(s): Wenting Zheng, Joseph Gonzalez, Evan Sparks, Alexandre M. Bayen

## Decoding from Pooled data: Phase Transitions of Message Passing

We consider the problem of decoding a discrete signal of categorical variables from the observation of several histograms of pooled subsets of it. We present an Approximate Message Passing (AMP) algorithm for recovering the signal in the random dense setting where each observed histogram involves a random subset of entries of size proportional to n. We characterize the performance of the algorithm in the asymptotic regime where the number of observations m tends to infinity proportionally to n, by deriving the corresponding State Evolution (SE) equations and studying their dynamics. We initiate the analysis of the multi-dimensional SE dynamics by proving their convergence to a fixed point, along with some further properties of the iterates. The analysis reveals sharp phase more...

Conference/Journal: short version submitted to International Symposium on Information Theory (ISIT), long version to be submitted to IEEE Transactions on Information Theory (IEEEIT)
Publication Date:
Download:
Author(s): Aaditya Ramdas, Ahmed El Alaoui, Michael Jordan, Florent Krzakala, Lenka Zdeborova

## Decoding from Pooled data: Sharp Information-Theoretic Bounds

Consider a population consisting of n individuals, each of whom has one of d types (e.g. their blood type, in which case d=4). We are allowed to query this database by specifying a subset of the population, and in response we observe a noiseless histogram (a d-dimensional vector of counts) of types of the pooled individuals. This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations. We are interested in the number of queries one needs to unambiguously determine the type of each individual. In this paper, we study this information-theoretic question under the random, dense setting where in each query, a random subset of individuals of size proportional more...

Conference/Journal: submitted to Annals of Applied Probability (AoAP)
Publication Date:
Download:
Author(s): Ahmed El Alaoui, Aaditya Ramdas, Michael Jordan, Florent Krzakala, Lenka Zdeborova

## Universality of Mallows’ and degeneracy of Kendall’s kernels for rankings

Kernel methods provide an attractive framework for aggregating and learning from ranking data, and so understanding the fundamental properties of kernels over permutations is a question of broad interest. We provide a detailed analysis of the Fourier spectra of the standard Kendall and Mallows kernels, and a new class of polynomial-type kernels. We prove that the Kendall kernel has exactly two irreducible representations at which the Fourier transform is non-zero, and moreover, the associated matrices are rank one. This implies that the Kendall kernel is nearly degenerate, with limited expressive and discriminative power. In sharp contrast, we prove that the Fourier transform of the Mallows kernel is a strictly positive definite matrix at all irreducible representations. This property guarantees that more...

Conference/Journal:
Publication Date:
Download:
Author(s): Horia Mania, Aaditya Ramdas, Martin J. Wainwright, Michael Jordan, Benjamin Recht

## Proteus: agile ML elasticity through tiered reliability in dynamic resource markets

Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time. But, exploiting such dynamic resource availability and the often fluctuating markets for them requires agile elasticity and effective acquisition strategies. Proteus aggressively exploits such transient revocable resources to do machine learning (ML) cheaper and/or faster. Its parameter server framework, AgileML, efficiently adapts to bulk additions and revocations of transient machines, through a novel 3-stage active-backup approach, with minimal use of more costly non-transient resources. Its BidBrain component adaptively allocates resources from multiple EC2 spot markets to minimize average cost per work as transient resource availability and cost change over time. more...

Conference/Journal: ACM European Conference on Computer Systems (EuroSys'2017)
Publication Date:
Download: http://www.pdl.cmu.edu/PDL-FTP/BigLearning/Proteus.pdf
Author(s): Horia Mania, Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, Phil Gibbons

## Morpheus: Towards Automated SLOs for Enterprise Clusters

Modern resource management frameworks for largescale analytics leave unresolved the problematic tension between high cluster utilization and job’s performance predictability—respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implicit user expectations as explicit Service Level Objectives (SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g., due to failures) by means of dynamic reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5x to 13x, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate more...

Conference/Journal: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16).
Publication Date:
Download: https://www.usenix.org/system/files/conference/osdi16/osdi16-jyothi.pdf
Author(s): Horia Mania, C. Curino, I. Menache, S. Narayanamurthy, Alexey Tumanov, J. Yaniv, R. Mavlyutov, I. Goiri, S. Krishnan, J. Kulkarni, S. Rao

## Clipper: A Low-Latency Online Prediction Serving System

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks and applications. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, more...

Conference/Journal: NSDI 2017
Publication Date:
Download:
Author(s): Dan Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph Gonzalez, Ion Stoica

## A DeVIL-ish approach to inconsistency in interactive visualizations

Declarative languages have a long tradition in both the database systems and data visualization communities, separating specifications from implementations.Declarative Visual Interaction Language presents a unified relational model for interactive visualization applications, and provides a basis to analyze aspects of interaction such as performance, correctness, and expressiveness. In this paper, we focus on a specific benefit provided by our approach: managing the consistency of interactive visualizations in the face of inherent asynchrony and reordering of events in modern data visualizations.

Conference/Journal: HILDA 2016
Publication Date:
Download:
Author(s): Yifan Wu, Joe Hellerstein, Eugene Wu

## Iterative methods for solving factorized linear systems

Stochastic iterative algorithms such as the Kaczmarz and Gauss-Seidel methods have gained recent attention because of their speed, simplicity, and the ability to approximately solve large-scale linear systems of equations without needing to access the entire matrix. In this work, we consider the setting where we wish to solve a linear system in a large matrix X that is stored in a factorized form, X = UV; this setting either arises naturally in many applications or may be imposed when working with large low-rank datasets for reasons of space required for storage. We propose a variant of the randomized Kaczmarz method for such systems that takes advantage of the factored form, and avoids computing X. We prove an exponential convergence more...

Conference/Journal:
Publication Date: 01/25/2017
Download:
Author(s): Aaditya Ramdas, Anna Ma, Deanna Needell

## High Performance Transactions via Early Write Visibility

In order to guarantee recoverable transaction execution, database systems permit a transaction’s writes to be observable only at the end of its execution. As a consequence, there is generally a delay between the time a transaction performs a write and the time later transactions are permitted to read it. This delayed write visibility can significantly impact the performance of serializable database systems by reducing concurrency among conflicting transactions. This paper makes the observation that delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write more...

Conference/Journal: PVLDB
Publication Date: 01/13/2017
Download:
Author(s): Joe Hellerstein, Jose M. Faleiro, Daniel J. Abadi

## Ground: A Data Context Service

Ground is an open-source data context service , a system to manage all the information that informs the use of data. Data usage has changed both philosophically and practically in the last decade, creating an opportunity for new data context services to foster further innovation. In this paper we frame the challenges of managing data context with basic ABCs: Applications, Behavior, and Change. We provide motivation and design guidelines, present our initial design of a common metamodel and API, and explore the current state of the storage solutions that could serve the needs of a data context service. Along the way we highlight opportunities for new research and engineering solutions.

Conference/Journal: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research
Publication Date: 01/09/2017
Download:
Author(s): Joe Hellerstein, Joseph Gonzalez, Vikram Sreekanti, et al.

## The p-filter: multi-layer FDR control for grouped hypotheses

In many practical applications of multiple hypothesis testing using the False Discovery Rate (FDR), the given hypotheses can be naturally partitioned into groups, and one may not only want to control the number of false discoveries (wrongly rejected null hypotheses), but also the number of falsely discovered groups of hypotheses (we say a group is falsely discovered if at least one hypothesis within that group is rejected, when in reality the group contains only nulls). In this paper, we introduce the p-filter, a procedure which unifies and generalizes the standard FDR procedure by Benjamini and Hochberg and global null testing procedure by Simes. We first prove that our proposed method can simultaneously control the overall FDR at the finest level more...

Conference/Journal: Journal of the Royal Statistical Society, Series B (Methodology)
Publication Date: 11/30/2016
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2017/03/paper1_arxiv.pdf
Author(s): Joe Hellerstein

## Reprogrammable Redundancy for Cache Vmin Reduction in a 28nm RISC-V Processor

The presented processor lowers SRAM-based cache Vmin by using three architectural techniques–bit bypass (BB), dynamic column redundancy (DCR), and line disable (LD)–that use low-overhead reprogrammable redundancy (RR) to avoid failing bitcells and therefore increase the maximum bitcell failure rate in processor caches. In the 28nm chip, the Vmin of the 1MB L2 cache is reduced by 25%, resulting in a 49% power reduction with a 2% area overhead and minimal timing overhead.

Conference/Journal: A-SSCC’16
Publication Date: 11/15/2016
Download:
Author(s): Krste Asanovic, B. Zimmer, P. F. Chiu, Bora Nikolic

## Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy

We propose a method to optimize the representation and distinguishability of samples from two probability distributions, by maximizing the estimated power of a statistical test based on the maximum mean discrepancy (MMD). This optimized MMD is applied to the setting of unsupervised learning by generative adversarial networks (GAN), in which a model attempts to generate realistic samples, and a discriminator attempts to tell these apart from data samples. In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples. Second, the MMD can be used to evaluate the performance of a generative model, by testing the model’s samples against a reference data set. In the more...

Conference/Journal: 5th International Conference on Learning Representations, Toulon, 2017
Publication Date: 11/14/2016
Download:
Author(s):

## ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or "what if?" analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.

Conference/Journal: SoCC '16 - Proceedings of the Seventh ACM Symposium on Cloud Computing
Publication Date: 10/06/2016
Download:
Author(s): Johann Schleier-Smith, Erik T. Krogen, Joe Hellerstein

## Sub-microsecond Adaptive Voltage Scaling in a 28nm FD-SOI Processor

This work presents a RISC-V system-on-chip (SoC) with integrated voltage regulation and power management implemented in 28nm FD-SOI. A fully integrated switched-capacitor DC-DC converter, coupled with an adaptive clocking system, achieves 82-89% system conversion efficiency of 41.8 double-precision GFLOPS/W. Measurement circuits can detect changes in processor workload and an integrated power management unit responds by adjusting the core voltage at sub-microsecond timescales. The power management system reduces the energy consumption of a synthetic benchmark by 39.8% with negligible performance penalty and 2.0% area overhead, enabling extremely fine-grained (<1µs) adaptive voltage scaling for mobile devices.

Conference/Journal: ESSCIRC - ESSDERC 2016
Publication Date: 09/15/2016
Download:
Author(s): Krste Asanovic, Benjamin Keller, Martin Cochet, Brian Zimmer, Jaehwa Kwak, Alberto Puggelli, Steven Bailey, Borivoje Nikolic, Palmer Dabbelt, et al.

## Scalable Atomic Visibility with RAMP Transactions

Databases can provide scalability by partitioning data across several servers. However, multipartition, multioperation transactional access is often expensive, employing coordination-intensive locking, validation, or scheduling mechanisms. Accordingly, many real-world systems avoid mechanisms that provide useful semantics for multipartition operations. This leads to incorrect behavior for a large class of applications including secondary indexing, foreign key enforcement, and materialized view maintenance. In this work, we identify a new isolation model—Read Atomic (RA) isolation—that matches the requirements of these use cases by ensuring atomic visibility: either all or none of each transaction’s updates are observed by other transactions. We present algorithms for Read Atomic Multipartition (RAMP) transactions that enforce atomic visibility while offering excellent scalability, guaranteed commit despite partial failures (via coordination-free execution), more...

Conference/Journal:
Publication Date: 08/15/2016
Download:
Author(s): Joe Hellerstein, Ion Stoica, Ali Ghodsi, Alan Fekete, Peter Bailis

## The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

This report makes the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, the performance and code density of existing commercial Complex Instruction Set Computers (CISC) while maintaining the simplicity and cost-effectiveness that underpins the original RISC goals [12]. We begin by comparing the dynamic instruction counts and dynamic instruction bytes fetched for the popular proprietary ARMv7, ARMv8, IA-32, and x86-64 Instruction Set Architectures (ISAs) against the free and open RISC-V RV64G and RV64GC ISAs when running the SPEC CINT2006 benchmark suite. RISCV was designed as a very small ISA to support a wide range of implementations, and has a less mature compiler toolchain. However, we observe that on SPEC CINT2006 RV64G executes on average more...

Conference/Journal: Technical Report
Publication Date: 07/15/2016
Download:
Author(s): Krste Asanovic, Christopher Celio, Daniel Dabbelt, David A. Patterson

## Time-evolving Graph Processing at Scale

Time-evolving graph-structured big data arises naturally in many application domains such as social networks and communication networks. However, existing graph processing systems lack support for efficient computations on dynamic graphs. In this paper, we represent most computations on time evolving graphs into (1) a stream of consistent and resilient graph snapshots, and (2) a small set of operators that manipulate such streams of snapshots. We then introduce GraphTau, a time-evolving graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphTau quickly builds fault-tolerant graph snapshots as each small batch of new data arrives. GraphTau achieves high performance and fault tolerant graph stream processing via a number of optimizations. GraphTau also unifies data streaming and more...

Conference/Journal: Graph Data-management Experiences & Systems (GRADES)
Publication Date: 06/24/2016
Download: https://rise.cs.berkeley.edu/wp-content/uploads/2017/02/graphtau-grades.pdf
Author(s): Anand Padmanabha Iyer, Li Erran Li, Tathagata Das, Ion Stoica

## Vector Processors for Energy-Efficient Embedded Systems

High-performance embedded processors are frequently designed as arrays of small, in-order scalar cores, even when their workloads exhibit high degrees of data-level parallelism (DLP). We show that these multiple instruction, multiple data (MIMD) systems can be made more efficient by instead directly exploiting DLP using a modern vector architecture. In our study, we compare arrays of scalar cores to vector machines of comparable silicon area and power consumption. Since vectors provide greater performance across the board – in some cases even with better programmability – we believe that embedded system designers should increasingly pursue vector architectures for machines at this scale.

Conference/Journal: ISCA 2016
Publication Date: 06/15/2016
Download:
Author(s): Krste Asanovic, Daniel Dabbelt, Colin Schmidt, Eric Love, Howard Mao, Sagar Karandikar

## Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL

This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of fourorders- of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more more...

Conference/Journal: ISCA 2016
Publication Date: 06/15/2016
Download:
Author(s): Krste Asanovic, Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeum Kim, Brian Zimmer

## Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling

Parallel processing frameworks (Dean and Ghemawat, 2004) accelerate jobs by breaking them into tasks that execute in parallel. However, slow running or straggler tasks can run up to 8 times slower than the median task on a production cluster (Ananthanarayanan et al., 2013), leading to delayed job completion and inefficient use of resources. Existing straggler mitigation techniques wait to detect stragglers and then relaunch them, delaying straggler detection and wasting resources. We built Wrangler (Yadwadkar et al., 2014), a system that predicts when stragglers are going to occur and makes scheduling decisions to avoid such situations. To capture node and workload variability, Wrangler built separate models for every node and workload, requiring the time-consuming collection of substantial training data. In more...

Conference/Journal: Journal of Machine Learning Research
Publication Date: 06/01/2016
Download: http://jmlr.org/papers/v17/15-149.html
Author(s): Neeraja Yadwadkar, Bharath Hariharan, Joseph Gonzalez, Randy Katz