RISE Seminar 5/3/18: Mehul Shah (Amazon Web Services): AWS Glue: Serverless Data Integration and Beyond

May 3, 2018

Title: AWS Glue: Serverless Data Integration and Beyond

Date: Thursday, May 3rd, 12-1pm,

Wozniak Lounge (430 Soda Hall)

Speaker: Mehul Shah

Affiliation: Amazon Web Services


Organizations want to analyze and gain insight from a growing number of new data sources, such as Internet of Things (IoT) streams, APIs, ad impressions, and log data. However, they are often limited by legacy ETL systems that were designed for transactional data. AWS Glue is a serverless data integration service for these modern data types. In this talk, we present cloud trends that motivate AWS Glue and the popular use-cases that drive its adoption. We show how simple it is to go from raw data to production data cleaning and transformation jobs with AWS Glue. It automatically crawls and catalogs your datasets, auto-generates scripts, allows you to interactively explore and iterate using your favorite notebooks, and then push jobs into production with the necessary dependencies and schedule. Finally, we describe the underlying data structures and optimizations developed for efficiently manipulating these semi-structured data sets and their use.

Mehul is currently the engineering lead and manager for the AWS Glue, a cloud service offered by Amazon.com for serverless data integration. His background is in large-scale data management, distributed systems, and energy efficient computing. Prior to Amazon, he was co-founder and CEO of Amiato (2011-2014), a managed ETL service in the cloud. From 2004-2011, he was a principal scientist at HP Labs. He has published in top-tier database and systems conferences, and his work has won best paper and test of time awards. He earned a PhD from U.C. Berkeley (2004) for scaling the TelegraphCQ data-stream processing system. He earned MEng (1997) and BS (1996) degrees in Computer Science and Physics from MIT. In his spare time, he serves on the Sort Benchmark committee.