Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR (AWS Big Data Blog Repost)

Alyssa Morrow blog, Distributed Systems, Open Source, Projects, Uncategorized 0 Comments

Note: This blogpost is replicated from the AWS Big Data Blog and can be found here. As the cost of genomic sequencing has rapidly decreased, the amount of publicly available genomic data has soared over the past couple of years. New cohorts and studies have produced massive datasets consisting of over 100,000 individuals. Simultaneously, these datasets have been processed to extract genetic variation across populations, producing mass amounts of variation data for each cohort. In this era of big data, tools like Apache Spark have provided a user-friendly platform for batch processing of large datasets. However, to use such tools as a sufficient replacement to current bioinformatics pipelines, we need more accessible and comprehensive APIs for processing genomic data. We …

A Short History of Prediction-Serving Systems

Daniel Crankshaw blog, Uncategorized 0 Comments

Machine learning is an enabling technology that transforms data into solutions by extracting patterns that generalize to new data. Much of machine learning can be reduced to learning a model — a function that maps an input (e.g. a photo) to a prediction (e.g. objects in the photo). Once trained, these models can be used to make predictions on new inputs (e.g., new photos) and as part of more complex decisions (e.g., whether to promote a photo). While there are thousands of papers published each year on how to design and train models, there is surprisingly less research on how to manage and deploy such models once they are trained. It is this later, often overlooked, topic that we discuss …

The Right to not be Tracked II: in which I turn off the location permission for Google, but it tracks me anyway

K. Shankari blog 0 Comments

I recently published a post about the blurry boundaries between standard system services and Google Maps on Android. I argued that these boundaries made it hard to talk about consent and competition around location services. However, the branching factor for the data sharing made the argument complex and hard to follow. Even as I was writing that post, in the train on the way into Berkeley, I started getting notifications from the Google app about the weather at my location. The Google app (aka Google Now) is a virtual assistant that is intended to provide context-sensitive helpful information to users. It is closed source, pre-installed, and it cannot be uninstalled or disabled. And I had already turned off all its …

The Right to not be Tracked: a Spotlight on Google Maps and Android Location Tracking

K. Shankari blog 1 Comment

There has been a lot of interest in data collected about users by Facebook recently. Journalists have been shocked when they downloaded the data that Facebook has on them. Most of this concern has been focused around data collected through explicit user interaction such as web browsing, or clicking on “Like” and “Share” buttons. Background data collection, which occurs without any explicit user intervention, is arguably creepier, because it collects data whether or not you interact with the service. For example, Facebook has been criticized for logging texts and phone calls in the background. Facebook argues that users consented to sharing the data, although many users are still skeptical about how explicit the consent was. Similarly, Uber had to backtrack …

Michael I. Jordan: Artificial Intelligence — The Revolution Hasn’t Happened Yet

Boban Zarkovich blog 0 Comments

(This article has originally been published on Medium.com.) Artificial Intelligence (AI) is the mantra of the current era. The phrase is intoned by technologists, academicians, journalists and venture capitalists alike. As with many phrases that cross over from technical academic fields into general circulation, there is significant misunderstanding accompanying the use of the phrase. But this is not the classical case of the public not understanding the scientists — here the scientists are often as befuddled as the public. The idea that our era is somehow seeing the emergence of an intelligence in silicon that rivals our own entertains all of us — enthralling us and frightening us in equal measure. And, unfortunately, it distracts us. There is a different narrative that one can …

Open source platform + undergraduate energy = sustainability research

K. Shankari blog 0 Comments

This Earth Day, join a study on motivating sustainable transportation behavior. I have blogged about the e-mission project earlier in the context of the National Transportation Data Challenge. (https://rise.cs.berkeley.edu/blog/making-cities-safer-data-collection-vision-zero/). To recap, e-mission focuses on building an extensible platform that can instrument the end-to-end multi-modal travel experience at the personal scale and collate it for analysis at the societal scale. In particular, it combines background data collection of trips, classified by modes, with user-reported incident data, and context-sensitive surveys. I also blogged earlier about involving undergraduates in research (https://amplab.cs.berkeley.edu/getting-a-dozen-20-year-olds-to-work-together-for-fun-and-social-good/). To recap, the challenges at the time included managing different skill levels, compressing the learn-plan-build cycle into one semester, and the fact that undergraduates typically don’t have the experience to build platform …

Online Foundations of Data Science Course Launches on edX!

Boban Zarkovich blog 0 Comments

UC Berkeley’s pathbreaking entry-level course on the Foundations of Data Science (Data 8) is launching on edX on April 2. This makes the fastest-growing class in UC Berkeley history available to everyone. Foundations of Data Science teaches computational and inferential thinking from the ground up. It covers everything from testing hypotheses, applying statistical inferences, visualizing distributions and drawing conclusions—all while coding in Python and using real world data sets. The course is taught by award-winning Berkeley professors and designed by a team of faculty working together across Berkeley’s Computer Science and Statistics Departments, led by RISE faculty Michael Jordan. The three 5-week online courses cover: Foundations of Data Science: Computational Thinking with Python, starting on April 2, teaches the basics …

Distributed Policy Optimizers for Scalable and Reproducible Deep RL

Eric Liang blog, Deep Learning, Distributed Systems, Open Source, Ray, Reinforcement Learning 0 Comments

In this blog post we introduce Ray RLlib, an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed policy optimizers that make it easy to use a variety of training strategies with existing reinforcement learning algorithms written in frameworks such as PyTorch, TensorFlow, and Theano. This enables complex architectures for RL training (e.g., Ape-X, IMPALA), to be implemented once and reused many times across different RL algorithms and libraries. We discuss in more detail the design and performance of policy optimizers in the RLlib paper. What’s next for RLlib In the near term we plan to continue building out RLlib’s set of policy optimizers and algorithms. Our aim is for RLlib to serve …

Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS 🗺

Joe Hellerstein blog, Database Systems, Distributed Systems, Real-Time, Systems, Uncategorized 0 Comments

This article cross-posted from the DataBeta blog. There’s fast and there’s fast. This post is about Anna, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna. Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question: what would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud? Turns out this kind of curiosity can lead to a system with pretty interesting practical …

RISECamp Behind the Scenes

Jey Kottalam blog

  RISECamp was held at UC Berkeley on September 7th and 8th. This post looks behind the scenes at the technical infrastructure used to provide a cloud-hosted cluster for each attendee with ready-to-use Jupyter notebooks requiring only a web browser to access. Background and Requirements RISECamp is the latest in a series of workshops held by RISELab (and its predecessor, AMPLab) showcasing the latest research from the lab. The sessions consist of talks on the latest research systems produced by the lab followed by tutorials and exercises for attendees to get hands-on practical experience using our latest technologies. In the past, attendees used their own laptops to perform the hands-on exercises, with each user setting up a local development environment and manually …

Fast Python Serialization with Ray and Apache Arrow

Robert Nishihara blog, Ray

This post was originally posted here. Robert Nishihara and Philipp Moritz are graduate students in the RISElab at UC Berkeley. This post elaborates on the integration between Ray and Apache Arrow. The main problem this addresses is data serialization. From Wikipedia, serialization is … the process of translating data structures or object state into a format that can be stored … or transmitted … and reconstructed later (possibly in a different computer environment). Why is any translation necessary? Well, when you create a Python object, it may have pointers to other Python objects, and these objects are all allocated in different regions of memory, and all of this has to make sense when unpacked by another process on another machine. Serialization and deserialization …

Ray: 0.2 Release

Robert Nishihara blog

This was originally posted on the Ray blog. We are pleased to announce the Ray 0.2 release. This release includes the following: substantial performance improvements to the Plasma object store an initial Jupyter notebook based web UI the start of a scalable reinforcement learning library fault tolerance for actors Plasma Since the last release, the Plasma object store has moved out of the Ray codebase and is now being developed as part of Apache Arrow (see the relevant documentation), so that it can be used as a standalone component by other projects to leverage high-performance shared memory. In addition, our Arrow-based serialization libraries have been moved into pyarrow (see the relevant documentation). In 0.2, we’ve increased the write throughput of the object store …

Low-Latency Model Serving with Clipper

Daniel Crankshaw blog

The mission of the RISELab is to develop technologies that enable applications to make low-latency decisions on live data with strong security. One of the first steps towards achieving this goal is to study techniques to evaluate machine learning models and quickly render predictions. This missing piece of machine learning infrastructure, the prediction serving system, is critical to delivering real-time and intelligent applications and services. As we studied the prediction-serving problem, two key challenges emerged. The first challenge is supporting the stringent performance demands of interactive serving workloads. As machine learning models improve they are increasingly being applied in business critical settings and user-facing interactive applications. This requires models to render predictions that can meet the strict latency requirements of …

Opaque: Secure Apache Spark SQL

Wenting Zheng blog, Security, Systems

As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to any attacker who can observe memory contents. This is a challenging problem because security usually implies a tradeoff between performance and functionality. Cryptographic approaches like fully homomorphic encryption provide full functionality to a system, but are extremely slow. Systems like CryptDB utilize lighter cryptographic primitives to provide a practical database, but are limited in functionality. Recent developments in trusted hardware enclaves (such as Intel SGX) provide a much needed alternative. These hardware enclaves provide hardware-enforced shielded execution that allows …

Announcing Ground v0.1

Vikram Sreekanti blog, Ground, News, Open Source, Projects, Systems

We’re excited to be releasing v0.1 of the Ground project! Ground is a data context service. It is a central repository for all the information surrounding the use of data in an organization. Ground concerns itself with what data an organization has, where that data is, who (both human beings and software systems) is touching that data, and how that data is being modified and described. Above all, Ground aims to be an open-source, vendor neutral system that provides users an unopinionated metamodel and set of APIs that allow them to think about and interact with data context generated in their organization. Ground has many use cases, but we’re focused on two specific ones at present: Data Inventory: large organizations …

Reinforcement Learning brings together RISELab and Berkeley DeepDrive for a joint mini-retreat

Alexey Tumanov blog, Deep Learning, Reinforcement Learning, Systems

On May 2, RISELab and the Berkeley DeepDrive (BDD) lab held a joint, largely student-driven mini-retreat. The event was aimed at exploring research opportunities at the intersection of the BDD and RISE labs. The topical focus of the mini-retreat was emerging AI applications, such as Reinforcement Learning (RL), and computer systems to support such applications. Trevor Darrell kicked off the event with an introduction to the Berkeley DeepDrive lab, followed by Ion Stoica’s overview of RISE. The event offered a great opportunity for researchers from both labs to exchange ideas about their ongoing research activity and discover points of collaboration. Philipp Moritz started the first student talk session with an update on Ray — a distributed execution framework for emerging …

RISELab Announces 3 Open Source Releases

Joe Hellerstein blog, Clipper, Ground, Open Source, Projects, Ray, Systems

Part of the Berkeley tradition—and the RISELab mission—is to release open source software as part of our research agenda. Six months after launching the lab, we’re excited to announce initial v0.1 releases of three RISElab open-source systems: Clipper, Ground and Ray. Clipper is an open-source prediction-serving system. Clipper simplifies deploying models from a wide range of machine learning frameworks by exposing a common REST interface and automatically ensuring low-latency and high-throughput predictions.  In the 0.1 release, we focused on reliable support for serving models trained in Spark and Scikit-Learn.  In the next release we will be introducing support for TensorFlow and Caffe2 as well as online-personalization and multi-armed bandits.  We are providing active support for early users and will be following Github issues …

Making cities safer: data collection for Vision Zero

K. Shankari blog

A critical part of enabling cities to implement their Vision Zero policies – the goal of the current National Transportation Data Challenge – is to be able to generate open, multi-modal travel experience data. While existing datasets use police and hospital reports to provide a comprehensive picture of fatalities and life altering injuries, by their nature, they are sparse and resist use for prediction and prioritization. Further, changes to infrastructure to support Vision Zero policies frequently require balancing competing needs from different constituencies – protected bike lanes, dedicated signals and expanded sidewalks all raise concerns that automobile traffic will be severely impacted. A timeline of the El Monte/Marich intersection in Mountain View, from 2014 to 2017 provides an opportunity to …

Declarative Heterogeneity Handling for Datacenter and ML Resources

Alexey Tumanov blog, Systems 0 Comments

Challenge Heterogeneity in datacenter resources has become the fact of life. We identify and categorize a number of different types of heterogeneity. When talking about heterogeneity, we generally refer to static or dynamic attributes associated with individual resources. Previously the levels of heterogeneity were fairly benign and limited to a few different types of processor architectures. Now, however, it has become a common trend to deploy hardware accelerators (e.g., Tesla K40/K80, Google TPU, Intel Xeon PHI) and even FPGAs (e.g., Microsoft Catapult project). Nodes themselves are connected with heterogeneous interconnects, oftentimes with more than one interconnect option available (e.g., 40Gbps ethernet backbone, Infiniband, FPGA torus topology). The workloads we consolidate on top of this diverse hardware differ vastly in their success metrics (completion …

RISELab at Spark Summit

Ion Stoica blog 0 Comments

This year, Spark Summit East was held in Boston between February 7-9. With over 1,500 attendees, this was the largest Spark Summit ever outside the Bay Area. Apache Spark, developed in large at AMPLab (the precursor of RISELab), is now the de-facto standard of big data processing. Like the previous Spark summits, UC Berkeley had a very strong presence. Ion Stoica gave a keynote on RISELab, describing the lab’s research focus on addressing a long-standing grand challenge in computing: enable machines to act autonomously and intelligently, to rapidly and repeatedly take appropriate actions based on information in the world around them. The presentation also discussed some early results from two recent projects, Drizzle and Opaque, which had their own presentations …

Serverless Scientific Computing

Eric Jonas blog, Projects, Systems 0 Comments

For many scientific and engineering users, cloud infrastructure remains challenging to use. While many of their use cases are embarrassingly parallel, the challenges involved in provisioning and using stateful cloud services keep them trapped on their laptops or large shared workstations. Before getting started, a new cloud user confronts a bewildering number of choices. First, what instance type do they need ? How do they make the compute/memory tradeoff? How large do they want their cluster to be? Can they take advantage of dynamic market-based instances (spot instances) that can disappear at any time? What if they have 1000 small jobs, each of which takes a few minutes — what’s the most cost-effective way of allocating servers? What host operating …

RISELab Kicks Off

melissa mecca Administrative, blog 0 Comments

Berkeley’s computer science division has an ongoing tradition of 5-year collaborative research labs. In the fall of 2016 we closed out the most recent of the series: the AMPLab. We think it was a pretty big deal, and many agreed. One great thing about Berkeley is the endless supply of energy and ideas that flows through the place — always bringing changes, building on what came before. In that spirit, we’re fired up to announce the Berkeley RISELab, where we will focus intensely for five years on systems that provide Real-time Intelligence with Secure Execution. Context RISELab represents the next chapter in the ongoing story of data-intensive systems at Berkeley; a proactive step to move beyond Big Data analytics into …