Inference Scaling of ML deployments

Project

20219 IML4E

Type

New service

Description

Optimizing the performance of machine learning (ML) models typically focuses on intrinsic factors, such as the number of neurons, depth of layers, or adjusting the numerical precision of computations. Such approaches inherently involve modifying a model’s internal architecture, attributes and parameters, which can lead to trade-offs in accuracy. Alternative avenues of performance optimization can also be explored. This experiment framework focuses on evaluating REST and gRPC protocols as the two widely supported protocols across various model-serving frameworks.

Contact

Dennis Muiruri

Email

dennis.muiruri@helsinki.fi

Research area(s)

Serving, performance

Technical features

The experiment framework provides a way to evaluate the following aspects of a model deployment.

Performance of supported inference protocols: Performance differences between endpoints can be attributed to the underlying data exchange protocols for a given payload type. gRPC outperforms REST as seen by lower latencies across evaluated serving frameworks. The payload type matters due to serialization overhead.

Evaluation of Serving framework performance: Some frameworks can be deemed more appropriate for production than others based on factors such as their performance profiles. Models implemented using TorchServe achieved better performance compared to Tensorflow Serving after controlling for differences in payload types.

Caching effects: Model serving frameworks can have different caching effects. Stronger caching effect was shown in TorchServe, compared to TensorFlow Serving.

Scalability under load: How serving frameworks scale inference under increased load. CPU utilization is better under higher loads especially in a multicore setting. The gRPC protocol is natively asynchronous, therefore the gRPC endpoint delivers a higher load to the server compared to natively synchronous REST endpoint. gRPC therefore can provide better performance with lower engineering overhead.

Integration constraints

None

Targeted customer(s)

ML engineers involved in deployment, System Architects, Cloud service providers, AI/ML consulting firms.

Conditions for reuse

Open access.

Confidentiality

Public

Publication date

25-09-2024

Involved partners

University of Helsinki (FIN)

Inference Scaling of ML deployments

Images