Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

Anand Iyer

Mingyu Guan

Yinwei Dai

Rui Pan

Swapnil Gandhi

Ravi Netravali

Published at ACM SOSP 2024

Abstract

Machine learning inference platforms continue to face high request rates and strict latency constraints. Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations. This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a per- input granularity: early exit models, which selectively allow certain inputs to exit a model from an intermediate layer. Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization) throughout model execution. We present 𝐸3, the first system that makes early exit models practical for realistic inference deployments. Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution, all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that 𝐸3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput). Additionally, 𝐸3’s goodput wins generalize to autoregressive LLMs (2.8-3.8×) and compressed models (1.67×).

Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

Abstract

Materials