Hound: Causal Learning for Datacenter-scale Straggler Diagnosis – Dr. Benjamin Lee, Duke University

Science & Technology

Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a framework that models task latency using system conditions to reveal causes of stragglers for each job. Hound discovers recurring causes across jobs by constructing topic models that treat jobs’ latency models as documents and their fitted parameters as words. Hound provides datacenter-scale diagnosis, interpretable models, unbiased inference, and computational efficiency. We demonstrate Hound’s capabilities for a production trace from Google’s warehouse-scale datacenters.

Benjamin Lee is an Associate Professor of Electrical and Computer Engineering at Duke University.

This talk was presented at the Arm Research Summit, 17-19 September 2018. Summit 2019 will be taking place in Austin, TX. Visit arm.com/summit for more details!