Justin Thaler Leads New NSF Project on Automatically Parallelizing Approximate Data Analysis with Mergeable Summaries

Posted in News Story

Justin Thaler (PI), Jeremy Fineman, and Ophir Frieder were awarded a $614K NSF grant to develop highly efficient, highly accurate, and highly scalable algorithms for approximate query processing.

The key ingredient enabling such efficiency is streaming algorithms that generate mergeable summaries. Streaming algorithms in general compute a small summary of the data, from which it is possible to derive accurate (though approximate) answers to queries. When these summaries are also mergeable, one can process massive data sets in highly scalable ways, by distributing the data across machines, summarizing each partition, and seamlessly combining the results.

The project is tightly entwined with the development of the Data Sketches library, an open source library of production-quality implementations of mergeable summaries that is widely used in industry and government.