In this talk, we will present Modin, a middle layer for DataFrames and interactive data science. Modin, formerly Pandas on Ray, is a library that allows users to speed up their Pandas workflows by changing a single line of code. During the presentation, we will discuss interesting ways Modin is being used, and show how we improve the performance of the most popular Pandas operations.
Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. Often, a challenge encountered when trying to use tools for large-scale data is that there is a significant learning overhead. Modin is designed to expose a set of familiar APIs (Pandas, SQL, etc.) and internally handle all of the data distribution and computation allowing users to use a distributed DataFrame without needing to understand partitioning or data shuffling. The goal is to allow Data Scientists to use the same system for 1 kilobytes as they do for 1 terabytes.
In this talk, we will present the general overview of Modin, and go into detail about how Modin can be used in a cloud environment. We will present a live demonstration on improving the speed over Pandas, and show how we can scale as more resources are available. We will also introduce Ray, the primary execution engine behind Modin, and discuss the ways the Ray helps in the development of this library. Currently, the Pandas API coverage represents more than 90% of use-cases based on a case study of Pandas usage, and SQL is still under development. We are constantly working to improve the performance and API coverage. Modin is completely open-source and can be found on GitHub: https://github.com/modin-project/modin