Devin Petersohn, Author at RISE Lab

Devin Petersohn

http://devinpetersohn.com

Publications

Towards Scalable Dataframe Systems

Blog Posts

So you want to build an open source tool/library as a grad student

Devin Petersohn August 12, 2021 blog 0 Comments

This is a collection of experiences and recommendations for building an open source community as a grad student

We don’t need Data Engineers, we need better tools for Data Scientists

Devin Petersohn April 7, 2021 blog 0 Comments

In most companies, Data Engineers support the Data Scientists in various ways. Often this means translating or productionizing the notebooks and scripts that a Data Scientist has written. A large portion of the Data Engineer’s role could be replaced with better tooling for Data Scientists, freeing Data Engineers to do more impactful (and scalable) work.

The Modin view of Scaling Pandas

Devin Petersohn July 7, 2020 blog, Database Systems, Distributed Systems, Modin 0 Comments

Scaling Interactive Pandas Workflows with Modin – Talk at PyData NYC 2018

Devin Petersohn February 2, 2019 blog, Modin 0 Comments

In this talk, we will present Modin, a middle layer for DataFrames and interactive data science. Modin, formerly Pandas on Ray, is a library that allows users to speed up their Pandas workflows by changing a single line of code. During the presentation, we will discuss interesting ways Modin is being used, and show how we improve the performance of the most popular Pandas operations. Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. Often, a challenge encountered when trying to use tools for large-scale data is that there is a significant learning overhead. Modin is designed to expose a set of familiar APIs (Pandas, SQL, etc.) and internally…