MLPerf: SPEC for ML

David Patterson Deep Learning, News, Open Source, Optimization, Reinforcement Learning, Systems, Uncategorized 0 Comments

The RISE Lab at UC Berkeley today joins Baidu, Google, Harvard University, and Stanford University to announce a new benchmark suite for machine learning called MLPerf at the O’Reilly AI conference in New York City (see

The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance eventually for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users.

Historical Inspiration. We are motivated in part by the System Performance Evaluation Cooperative (SPEC) benchmark for general-purpose computing that drove rapid, measurable performance improvements  for decades starting in the 1980s.

Goals. Learning from the 40 year history of benchmarks, MLPerf has these primary goals:

  • Accelerate progress in ML via fair and useful measurement
  • Serve both the commercial and research communities
  • Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML
  • Enforce replicability to ensure reliable results
  • Keep benchmarking effort affordable so all can participate

General Approach. Our approach is to select a set of ML problems, each defined by a dataset and quality target, then measure the wall clock time to train a model for each problem.

Broad and Representative Problems. Like the Fathom Benchmark, the MLPerf suite aims to reflect different areas of ML that are important to the commercial and research communities and where open datasets and models exist. Here is our current list of problems:

  • Image classification
  • Object detection
  • Speech to text
  • Translation
  • Recommendation
  • Sentiment Analysis
  • Reinforcement Learning

Both Closed and Open Model Divisions. Balancing fairness and innovation is difficult challenge for all benchmarks. Inspired by the Sort benchmarks, we take a two-pronged approach:

  1. The MLperf Closed Model Division specifies the model to be used and restricts the values of hyper parameters, e.g. batch size and learning rate, with the emphasis being on fair comparisons of the hardware and software systems. (The Sort equivalent is called “Daytona,” alluding to the stock cars at the Daytona 500 mile race.)
  2. In the MLperf Open Model Division, competitors must solve the same problem using the same data set but with fewer restrictions, with the emphasis being on advancing the state-of-the-art of ML. (The Sort equivalent was called “Indy,” alluding to the even faster Formula One custom race cars designed for events like the Indianapolis 500.)

Ideally, the advances developed in the Open category will be incorporated into future generations of the Closed benchmarks.

Metrics. Following the precedent of the DAWNBench, the primary MLperf metric for training is defined as the wall clock time to train a model to a minimum quality (often in hours). Each benchmark has a quality metric defined by its original authors. The philosophy is to back off a bit from bleeding edge state-of-the-art quality level to reduce the difficulty and cost of running the benchmark. Given the rapid pace of ML, the quality metric will likely rise over the generations of the MLPerf suite.

SPEC also records the power consumed by systems on most of its benchmarks, allowing comparisons of performance per watt, which can be helpful in determining the operational cost of systems. 

The ML community is a heavy user of the commercial cloud, which does not report power. MLPerf will show cost for cloud-based submissions, as does DAWNBench. MLPerf will show power for on-premise or mobile systems.

Organization. While the initial proposal for a benchmarking suite comes from a collaboration of five groups (three academic and two industry) with experience in benchmarking ML systems, it is our hope and plan that this effort grows into a much wider collaboration that spans many academic groups, companies, and other organizations. Indeed, six more organizations have already joined in the announcement of MLPerf today: AMD, Intel, Sambanova, University of Minnesota, University of Toronto, and Wave Computing. We expect this list to expand significantly in the near future.


Leave a Reply