Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR (AWS Big Data Blog Repost)

Alyssa Morrow blog, Distributed Systems, Open Source, Projects, Uncategorized 0 Comments

Note: This blogpost is replicated from the AWS Big Data Blog and can be found here. As the cost of genomic sequencing has rapidly decreased, the amount of publicly available genomic data has soared over the past couple of years. New cohorts and studies have produced massive datasets consisting of over 100,000 individuals. Simultaneously, these datasets have been processed to extract genetic variation across populations, producing mass amounts of variation data for each cohort. In this era of big data, tools like Apache Spark have provided a user-friendly platform for batch processing of large datasets. However, to use such tools as a sufficient replacement to current bioinformatics pipelines, we need more accessible and comprehensive APIs for processing genomic data. We …

A Short History of Prediction-Serving Systems

Daniel Crankshaw blog, Uncategorized 0 Comments

Machine learning is an enabling technology that transforms data into solutions by extracting patterns that generalize to new data. Much of machine learning can be reduced to learning a model — a function that maps an input (e.g. a photo) to a prediction (e.g. objects in the photo). Once trained, these models can be used to make predictions on new inputs (e.g., new photos) and as part of more complex decisions (e.g., whether to promote a photo). While there are thousands of papers published each year on how to design and train models, there is surprisingly less research on how to manage and deploy such models once they are trained. It is this later, often overlooked, topic that we discuss …

MLPerf: SPEC for ML

David Patterson Deep Learning, News, Open Source, Optimization, Reinforcement Learning, Systems, Uncategorized 0 Comments

The RISE Lab at UC Berkeley today joins Baidu, Google, Harvard University, and Stanford University to announce a new benchmark suite for machine learning called MLPerf at the O’Reilly AI conference in New York City (see https://mlperf.org/). The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance eventually for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users. Historical Inspiration. We are motivated in part by the System Performance Evaluation Cooperative (SPEC) benchmark for general-purpose computing that drove rapid, …

Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS 🗺

Joe Hellerstein blog, Database Systems, Distributed Systems, Real-Time, Systems, Uncategorized 0 Comments

This article cross-posted from the DataBeta blog. There’s fast and there’s fast. This post is about Anna, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna. Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question: what would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud? Turns out this kind of curiosity can lead to a system with pretty interesting practical …

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark

Shivaram Venkataraman Uncategorized

This work was done in collaboration with Ding Ding and Sergey Ermolin from Intel. In recent years, the scale of datasets and models used in deep learning has increased dramatically. Although larger datasets and models can improve the accuracy in many AI applications, they often take much longer to train on a single machine. However, it is not very common to distribute the training to large clusters using current popular deep learning frameworks, compared to what’s been long around in the Big Data area, as it’s often harder to gain access to a large GPU cluster and lack of convenient facilities in popular DL frameworks for distributed training. By leveraging the cluster distribution capabilities in Apache Spark, BigDL successfully performs very large-scale distributed …

RISELab and the 5G Innovators Initiative (5GI2)

Randy Katz News, Uncategorized

5G, also known as Fifth Generation Mobile Networks, is an emerging global telecommunication system designed for the next generation of significantly higher wireless data bandwidths to support a variety of consumer, commercial, and industrial applications. On promise are data rates of 10-100 mbps for tens of thousands of simultaneous users in the metropolitan area, with 1 gbps indoors and connectivity for hundreds of thousands of simultaneously connected sensors. As important as these enhanced bandwidths will be the software extensibility and configurability of the 5G network, making it possible to partition and customize network bandwidth and services for a variety of site- and area-specific applications to support diverse devices at the network edge. RISELab and our industrial sponsors Ericsson, Intel, and …