Ray is an open source project for parallel and distributed Python. This article was originally posted here. Parallel and distributed computing are a staple of modern applications. We need to leverage multiple cores or multiple machines to speed up applications or to run them at a large scale. The infrastructure for crawling the web and responding to search queries are not single-threaded programs running on someone’s laptop but rather collections of services that communicate and interact with one another. This post will describe how to use Ray to easily build applications that can scale from your laptop to a large cluster. Why Ray? Many tutorials explain how to use Python’s multiprocessing module. Unfortunately the multiprocessing module is severely limited in…
Scaling Interactive Pandas Workflows with Modin – Talk at PyData NYC 2018
In this talk, we will present Modin, a middle layer for DataFrames and interactive data science. Modin, formerly Pandas on Ray, is a library that allows users to speed up their Pandas workflows by changing a single line of code. During the presentation, we will discuss interesting ways Modin is being used, and show how we improve the performance of the most popular Pandas operations. Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. Often, a challenge encountered when trying to use tools for large-scale data is that there is a significant learning overhead. Modin is designed to expose a set of familiar APIs (Pandas, SQL, etc.) and internally…
Modin (Pandas on Ray) – October 2018
View the code on Gist.