Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
Published at
ACM SOSP
2024

Abstract
Machine learning inference platforms continue to face high
request rates and strict latency constraints. Existing solutions
largely focus on compressing models to substantially lower
compute costs (and time) with mild accuracy degradations.
This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a per-
input granularity: early exit models, which selectively allow
certain inputs to exit a model from an intermediate layer.
Though intuitive, early exits face fundamental deployment
challenges, largely owing to the effects that exiting inputs
have on batch size (and resource utilization) throughout model
execution. We present 𝐸3, the first system that makes early
exit models practical for realistic inference deployments. Our
key insight is to split and replicate blocks of layers in models
in a manner that maintains a constant batch size throughout
execution, all the while accounting for resource requirements
and communication overheads. Evaluations with NLP and
vision models show that 𝐸3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost
(for a fixed goodput). Additionally, 𝐸3’s goodput wins generalize to autoregressive LLMs (2.8-3.8×) and compressed
models (1.67×).