Dissertation Talk: Machine Learning for Query Optimization by Zongheng Yang
July 27, 2022
Speaker: Zongheng Yang
Advisor: Ion Stoica
In the past two decades, data has been growing at an ever increasing rate, and systems that process data to answer queries have attracted significant attention. Crucial to the performance of data systems is the query optimizer, which translates declarative queries (e.g., SQL) into efficient execution plans. However, the optimization task is highly complex, leading to two key challenges. First, optimizers use a myriad of hand-designed heuristics to tame the complexity, but heuristics leave performance on the table. Second, optimizers are highly costly to develop: human experts may spend months writing a first version and years refining it.
In this dissertation, I apply machine learning to tame the complexity in query optimization. First, I present Naru and NeuroCard, two learned cardinality estimators based on self-supervised learning that remove long-standing heuristics used in modeling tables. By removing heuristics, Naru and NeuroCard improve the accuracy of cardinality estimation by orders of magnitude compared to the prior state-of-the-art. Second, I present Balsa, a deep reinforcement learning agent that learns to optimize SQL queries by trial-and-error. With a few hours of learning, Balsa can outperform the optimizers of PostgreSQL (one of the most popular database systems) and a commercial engine. Together, this series of projects improves existing query optimizers while opening the possibility of alleviating the complex optimization in future environments and engines.