FLEX: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution
Published at
ACM EuroSys
2025

Abstract
Significant breakthroughs in machine learning ( ML) and
the advantages of on-device processing have led to edge
devices increasingly incorporating accelerators like GPUs,
NPUs, and DSPs. However, these accelerators consume en-
ergy, prompting users to limit their floating-point precision.
Many edge device users are in regions where including high-
fidelity accelerators is too costly, leading to low-cost devices
with low precision, sacrificing accuracy. Previous work pre-
determined layer assignments between the CPU and acceler-
ator offline for high accuracy and low latency without con-
sidering the input, but we observe that input affects optimal
layer assignment. To address this, we present Flex, a system
for Fast, Accurate DNN Inference on Low-Cost Edges using
Heterogeneous Accelerator eXecution. Leveraging common
observations from models on various edge devices, Flex uses
a lightweight heuristic and reinforcement learning (RL) to
dynamically assign layers across the CPU and accelerator.
Experiments show Flex improves average inference time by
up to 39%, accuracy by up to 22%, and energy consumption
by up to 61% compared to state-of-the-art methods, and is
only 4.2% less optimal than the best achievable results.