Alexey Tumanov

I am a PostDoc at the University of California Berkeley, working with Ion Stoica. I completed my PhD at Carnegie Mellon, advised by Greg Ganger and collaborating closely with Mor Harchol-Balter and Onur Mutlu. My work at CMU was partially funded by the NSERC CGS-D3 Fellowship as well as the Intel Science and Technology Centre for Cloud Computing and Parallel Data Lab. Prior to Carnegie Mellon, I worked on agile stateful VM replication with para-virtualization at the University of Toronto. My interest in cloud computing brought me to UofT from industry, where I had worked on the development of cluster middleware responsible for distributed datacenter resource management. My most recent research focused on modeling, design, and development of abstractions, primitives, algorithms and systems artifacts for a general resource management framework with support for static and dynamic heterogeneity, hard and soft placement constraints, time-varying resource capacity guarantees, and combinatorial constraints in heterogeneous datacenters in the context of defining the next generation datacenter operating system stack.
http://www.cs.berkeley.edu/~atumanov

Blog Posts

Ray: Application-level scheduling with custom resources

Alexey Tumanov blog, Distributed Systems, Ray 1 Comment

  Application-level scheduling with custom resources New to Ray? Start Here! Ray intends to be a universal framework for a wide range of machine learning applications. This includes distributed training, machine learning inference, data processing, latency-sensitive applications, and throughput-oriented applications. Each of these applications has different, and, at times, conflicting requirements for resource management. Ray intends to cater to all of them, as the newly emerging microkernel for distributed machine learning. In order to achieve that kind of generality, Ray enables explicit developer control with respect to the task and actor placement by using custom resources. In this blog post we are going to talk about use cases and provide examples. This article is intended for readers already familiar with…

Reinforcement Learning brings together RISELab and Berkeley DeepDrive for a joint mini-retreat

Alexey Tumanov blog, Deep Learning, Reinforcement Learning, Systems

On May 2, RISELab and the Berkeley DeepDrive (BDD) lab held a joint, largely student-driven mini-retreat. The event was aimed at exploring research opportunities at the intersection of the BDD and RISE labs. The topical focus of the mini-retreat was emerging AI applications, such as Reinforcement Learning (RL), and computer systems to support such applications. Trevor Darrell kicked off the event with an introduction to the Berkeley DeepDrive lab, followed by Ion Stoica’s overview of RISE. The event offered a great opportunity for researchers from both labs to exchange ideas about their ongoing research activity and discover points of collaboration. Philipp Moritz started the first student talk session with an update on Ray — a distributed execution framework for emerging…

Declarative Heterogeneity Handling for Datacenter and ML Resources

Alexey Tumanov blog, Systems 0 Comments

Challenge Heterogeneity in datacenter resources has become the fact of life. We identify and categorize a number of different types of heterogeneity. When talking about heterogeneity, we generally refer to static or dynamic attributes associated with individual resources. Previously the levels of heterogeneity were fairly benign and limited to a few different types of processor architectures. Now, however, it has become a common trend to deploy hardware accelerators (e.g., Tesla K40/K80, Google TPU, Intel Xeon PHI) and even FPGAs (e.g., Microsoft Catapult project). Nodes themselves are connected with heterogeneous interconnects, oftentimes with more than one interconnect option available (e.g., 40Gbps ethernet backbone, Infiniband, FPGA torus topology). The workloads we consolidate on top of this diverse hardware differ vastly in their success metrics (completion…