Opaque: Secure Apache Spark SQL

Wenting Zheng blog, Security, Systems

As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to any attacker who can observe memory contents. This is a challenging problem because security usually implies a tradeoff between performance and functionality. Cryptographic approaches like fully homomorphic encryption provide full functionality to a system, but are extremely slow. Systems like CryptDB utilize lighter cryptographic primitives to provide a practical database, but are limited in functionality.

Recent developments in trusted hardware enclaves (such as Intel SGX) provide a much needed alternative. These hardware enclaves provide hardware-enforced shielded execution that allows arbitrary computation on encrypted data.

We designed and implemented Opaque, a package for Apache Spark SQL that utilizes Intel SGX to enable very strong security for SQL queries. With SGX, we can achieve memory-level data encryption and authentication so that even an attacker who has root access never sees decrypted data. Opaque also provides an additional execution mode call oblivious mode. In this mode, we are able to prevent a sophisticated side-channel attack called data access pattern attack by using special algorithms to hide these patterns. Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.

You can read our NSDI paper here.