A Joint Model Provisioning and Request Dispatch Solution for Low-Latency Inference Services on Edge

Anish Prasad; Carl Mofjeld; Yang Peng

doi:10.3390/s21196594

A Joint Model Provisioning and Request Dispatch Solution for Low-Latency Inference Services on Edge

Sensors (Basel). 2021 Oct 2;21(19):6594. doi: 10.3390/s21196594.

Authors

Anish Prasad¹, Carl Mofjeld¹, Yang Peng¹

Affiliation

¹ Division of Computing and Software Systems, University of Washington Bothell, Bothell, WA 98011, USA.

Abstract

With the advancement of machine learning, a growing number of mobile users rely on machine learning inference for making time-sensitive and safety-critical decisions. Therefore, the demand for high-quality and low-latency inference services at the network edge has become the key to modern intelligent society. This paper proposes a novel solution that jointly provisions machine learning models and dispatches inference requests to reduce inference latency on edge nodes. Existing solutions either direct inference requests to the nearest edge node to save network latency or balance edge nodes' workload by reducing queuing and computing time. The proposed solution provisions each edge node with the optimal number and type of inference instances under a holistic consideration of networking, computing, and memory resources. Mobile users can thus be directed to utilize inference services on the edge nodes that offer minimal serving latency. The proposed solution has been implemented using TensorFlow Serving and Kubernetes on an edge cluster. Through simulation and testbed experiments under various system settings, the evaluation results showed that the joint strategy could consistently achieve lower latency than simply searching for the best edge node to serve inference requests.

Keywords: Kubernetes; edge computing; machine learning inference.

MeSH terms

Computer Simulation
Machine Learning*