Originally posted on EECS NEWS: Submitted by Magdalene L. Crowley on September 7, 2021 – 1:55pm EPIC Lab receives $2M NSF grant to build tools for criminal justice big datasets CS Prof. Joseph Hellerstein, and Assistant Profs. Aditya Parmeswaran and Sarah Chasins, are among the principal investigators of a new lab that has just received a $2M grant from the National Science Foundation to make big datasets used by the criminal justice system more accessible to non-technical researchers. The Effective Programming, Interaction, and Computation with Data (EPIC) Lab will create tools that utilize machine learning, program synthesis, and human-centered design, to improve the ability of public defenders, investigators and paralegals to research police misconduct, judicial decision-making, and related issues, for their…
MAGE Wins Best Paper Award at OSDI
The paper “MAGE: Nearly Zero-Cost Virtual Memory for Secure Computation” was among three selected at this Best Paper award at OSDI ’21. Congratulations to Sam Kumar, Dave Culler and Raluca Ada Popa for their win. Read the paper here. Abstract: Secure Computation (SC) is a family of cryptographic primitives for computing on encrypted data in single-party and multi-party settings. SC is being increasingly adopted by industry for a variety of applications. A significant obstacle to using SC for practical applications is the memory overhead of the underlying cryptography. We develop MAGE, an execution engine for SC that efficiently runs SC computations that do not fit in memory. We observe that, due to their intended security guarantees, SC schemes are inherently oblivious—their…
Great Forbes article about Databricks
Databricks is one of the companies with roots in UC Berkeley – specifically, AMPlab and RISElab. This Forbes article gives the in depth narrative of their phenomenal success. (Subscription may be required)
Professor Popa: Decentralized Security CS 294-163
Lectures: Tue/Thur 3:30pm – 4:59pm, 310 Soda Course description: Recently, there has been much excitement in both academia and industry around the notion of decentralized security, which refers to, loosely speaking, security mechanisms that do not rely on the trustworthyness of any central entity. In only a few years, this area has generated many beautiful cryptographic constructs as well as exciting systems with real-world adoption. The course will cover topics such as decentralized ledgers, blockchain/cryptocurrencies, decentralized access control, secure multi-party computation, federated learning, coopetitive learning, and others. This is an advanced course, which will go deeply into both cryptography and systems. A solid foundation in cryptography is required, and a similar foundation in systems is beneficial. Logistics: The course is…
RISELab March Newsletter
For a summary of recent news and publications, check out the RISELab Newsletter here.
Modern Parallel and Distributed Python: A Quick Tutorial on Ray
Ray is an open source project for parallel and distributed Python. This article was originally posted here. Parallel and distributed computing are a staple of modern applications. We need to leverage multiple cores or multiple machines to speed up applications or to run them at a large scale. The infrastructure for crawling the web and responding to search queries are not single-threaded programs running on someone’s laptop but rather collections of services that communicate and interact with one another. This post will describe how to use Ray to easily build applications that can scale from your laptop to a large cluster. Why Ray? Many tutorials explain how to use Python’s multiprocessing module. Unfortunately the multiprocessing module is severely limited in…
A History of Postgres
(crossposted from databeta.wordpress.com) The ACM began commissioning a series of reminiscence books on Turing Award winners. Thanks to hard work by editor Michael Brodie, the first one is Mike Stonebraker’s book, which just came out. I was asked to write the chapter on Postgres. I was one of the large and distinguished crew of grad students on the Postgres project, so this was fun. ACM in its wisdom decided that these books would be published in a relatively traditional fashion—i.e. you have to pay for them. The publisher, Morgan-Claypool, has this tip for students and ACM members: Please note that the Bitly link goes to a landing page where Students, ACM Members, and Institutions who have access to the ACM…
RISE Camp 2018 Tutorials now available
Tutorials from RISE Camp held at International House in Berkeley, CA on October 11-12, 2018 are now available the RISE Camp website here. Additional information including the Agenda can also be found there.
Going Fast and Cheap: How We Made Anna Autoscale
Background: In an earlier blog post, we described a system called Anna, which used a shared-nothing, thread-per-core architecture to achieve lightning-fast speeds by avoiding all coordination mechanisms. Anna also used lattice composition to enable a rich variety of coordination-free consistency levels. The first version of Anna blew existing in-memory KVSes out of the water: Anna is up to 700x faster than Masstree, an earlier state-of-the-art research KVS, and up to 800x faster than Intel’s “lock-free” TBB hash table. You can find the previous blog post here and the full paper here. We refer to that version of Anna as “Anna v0.” In this post, we describe how we extended the fastest KVS in the cloud to be extremely cost-efficient and…
AP releases “bombshell” report on Google’s location history prompted by RISELab blog post
Back in May 2018, K. Shankari posted a blog post on the blurry boundaries associated with Google location tracking and the interesting questions that they raised around consent, control and competition. An AP reporter, Ryan Nakashima, saw the blog posts, and contacted her for more details. While he was not able to reproduce the behavior that she had observed, he was able to work with Jonathan Mayer‘s group at Princeton to find similarly unclear and confusing privacy policies related to location history. The resulting story and step-by-step guide was posted on Monday, Aug 13, and had a large impact. In the technical press, it was picked up by at least Wired, CNET, TechCrunch, Gizmodo and Slashdot. Ryan, Jonathan and Shankari…
Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR (AWS Big Data Blog Repost)
Note: This blogpost is replicated from the AWS Big Data Blog and can be found here. As the cost of genomic sequencing has rapidly decreased, the amount of publicly available genomic data has soared over the past couple of years. New cohorts and studies have produced massive datasets consisting of over 100,000 individuals. Simultaneously, these datasets have been processed to extract genetic variation across populations, producing mass amounts of variation data for each cohort. In this era of big data, tools like Apache Spark have provided a user-friendly platform for batch processing of large datasets. However, to use such tools as a sufficient replacement to current bioinformatics pipelines, we need more accessible and comprehensive APIs for processing genomic data. We…
Implementing A Parameter Server in 15 Lines of Python with Ray
This blog post was originally posted here. View the code on Gist.
A Short History of Prediction-Serving Systems
Machine learning is an enabling technology that transforms data into solutions by extracting patterns that generalize to new data. Much of machine learning can be reduced to learning a model — a function that maps an input (e.g. a photo) to a prediction (e.g. objects in the photo). Once trained, these models can be used to make predictions on new inputs (e.g., new photos) and as part of more complex decisions (e.g., whether to promote a photo). While there are thousands of papers published each year on how to design and train models, there is surprisingly less research on how to manage and deploy such models once they are trained. It is this later, often overlooked, topic that we discuss…
MLPerf: SPEC for ML
The RISE Lab at UC Berkeley today joins Baidu, Google, Harvard University, and Stanford University to announce a new benchmark suite for machine learning called MLPerf at the O’Reilly AI conference in New York City (see https://mlperf.org/). The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance eventually for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users. Historical Inspiration. We are motivated in part by the System Performance Evaluation Cooperative (SPEC) benchmark for general-purpose computing that drove rapid,…
Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS 🗺
This article cross-posted from the DataBeta blog. There’s fast and there’s fast. This post is about Anna, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna. Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question: what would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud? Turns out this kind of curiosity can lead to a system with pretty interesting practical…
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark
This work was done in collaboration with Ding Ding and Sergey Ermolin from Intel. In recent years, the scale of datasets and models used in deep learning has increased dramatically. Although larger datasets and models can improve the accuracy in many AI applications, they often take much longer to train on a single machine. However, it is not very common to distribute the training to large clusters using current popular deep learning frameworks, compared to what’s been long around in the Big Data area, as it’s often harder to gain access to a large GPU cluster and lack of convenient facilities in popular DL frameworks for distributed training. By leveraging the cluster distribution capabilities in Apache Spark, BigDL successfully performs very large-scale distributed…
Ray 0.2 released!
Ray 0.2 has been released: https://ray-project.github.io/2017/09/30/ray-0.2-release.html
RISELab and the 5G Innovators Initiative (5GI2)
5G, also known as Fifth Generation Mobile Networks, is an emerging global telecommunication system designed for the next generation of significantly higher wireless data bandwidths to support a variety of consumer, commercial, and industrial applications. On promise are data rates of 10-100 mbps for tens of thousands of simultaneous users in the metropolitan area, with 1 gbps indoors and connectivity for hundreds of thousands of simultaneously connected sensors. As important as these enhanced bandwidths will be the software extensibility and configurability of the 5G network, making it possible to partition and customize network bandwidth and services for a variety of site- and area-specific applications to support diverse devices at the network edge. RISELab and our industrial sponsors Ericsson, Intel, and…