Dissertation Talk: Machine Learning for Resource Management in the Datacenter and the Cloud
May 10, 2018
Title: Machine Learning for Resource Management in the Datacenter and the Cloud
Speaker: Neeraja J. Yadwadkar
Advisors: Randy Katz and Joseph Gonzalez
Date: Thursday, May 10th, 2018
Time: 1-2pm
Location: 465H Soda Hall
Abstract:
Traditional resource management techniques that rely on simple heuristics often fail to achieve predictable
performance in contemporary complex systems that span physical servers, virtual servers, private and/or
performance in contemporary complex systems that span physical servers, virtual servers, private and/or
public clouds. My research aims to bring the benefits of data-driven models to resource management of such
complex systems. In my dissertation, I argue that the advancements in machine learning can be leveraged
to manage and optimize today’’s systems by deriving actionable insights from the performance and utilization
data these systems generate. To realize this vision of model-based resource management, we need to deal
with the key challenges data-driven models raise: uncertainty in predictions, cost of training, and cost of updating the models.
to manage and optimize today’’s systems by deriving actionable insights from the performance and utilization
data these systems generate. To realize this vision of model-based resource management, we need to deal
with the key challenges data-driven models raise: uncertainty in predictions, cost of training, and cost of updating the models.
In this talk, I will discuss these broad themes in the context of two problems: scheduling jobs on a cluster and
virtual machine (VM) selection in the public cloud. I will begin by presenting Wrangler, a system that predicts when
stragglers (slow-running tasks) are going to occur based on cluster resource utilization counters and makes scheduling
decisions to avoid such situations. Wrangler introduces a notion of a confidence measure with these predictions
to overcome modeling uncertainty. I will then describe our Multi-Task Learning formulations that share information
between the various models, allowing us to significantly reduce the cost of training. Finally, I will present the
highlights of our work on the PARIS system that enables cloud users to select the best VM (virtual machine)
for their applications in the public cloud environments.
virtual machine (VM) selection in the public cloud. I will begin by presenting Wrangler, a system that predicts when
stragglers (slow-running tasks) are going to occur based on cluster resource utilization counters and makes scheduling
decisions to avoid such situations. Wrangler introduces a notion of a confidence measure with these predictions
to overcome modeling uncertainty. I will then describe our Multi-Task Learning formulations that share information
between the various models, allowing us to significantly reduce the cost of training. Finally, I will present the
highlights of our work on the PARIS system that enables cloud users to select the best VM (virtual machine)
for their applications in the public cloud environments.