FLEX: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution

Picture of Tanmoy Sen
Tanmoy Sen
Picture of Haiying Shen
Haiying Shen
Published at ACM EuroSys 2025

Abstract

Significant breakthroughs in machine learning ( ML) and the advantages of on-device processing have led to edge devices increasingly incorporating accelerators like GPUs, NPUs, and DSPs. However, these accelerators consume en- ergy, prompting users to limit their floating-point precision. Many edge device users are in regions where including high- fidelity accelerators is too costly, leading to low-cost devices with low precision, sacrificing accuracy. Previous work pre- determined layer assignments between the CPU and acceler- ator offline for high accuracy and low latency without con- sidering the input, but we observe that input affects optimal layer assignment. To address this, we present Flex, a system for Fast, Accurate DNN Inference on Low-Cost Edges using Heterogeneous Accelerator eXecution. Leveraging common observations from models on various edge devices, Flex uses a lightweight heuristic and reinforcement learning (RL) to dynamically assign layers across the CPU and accelerator. Experiments show Flex improves average inference time by up to 39%, accuracy by up to 22%, and energy consumption by up to 61% compared to state-of-the-art methods, and is only 4.2% less optimal than the best achievable results.