ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Boban Zarkovich

Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or "what if?" analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.

We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS). In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.

Published On: October 6, 2016

Presented At/In: SoCC '16 - Proceedings of the Seventh ACM Symposium on Cloud Computing

Link: https://tinyurl.com/h9csy93

Authors: Johann Schleier-Smith, Erik T. Krogen, Joe Hellerstein