Real-time inference at scale on AWS

UfyURzQ.png!web

The primary way to incorporate machine learning into applications is by deploying a trained model as a web API on cloud infrastructure. Running inference at scale requires an understanding of several engineering design principles that may have more to do with DevOps than Data Science.

Running machine learning in production is not just about deploying models, but also ensuring your deployments are maintainable, your spend is optimized, and your web services are scalable. In this post, I’m going to highlight and explain concepts that are relevant for building infrastructure for real-time inference at scale on AWS.

1. Containerization

The atomic unit of an efficient inference cluster should be a Docker container per model. This abstraction enables autoscaling because each container is only running one specific type of workload. It also simplifies logging because there are no interleaved logs from different inference processes. Resource scheduling becomes easier because each container can be configured to request only the resources that optimize inference for its particular model. Finally, each container can be built using a minimal set of custom dependencies regardless of other workloads.

The increasingly obvious choice for orchestrating containers is Kubernetes . In addition to providing a relatively simple way to program against a cluster of virtual machines, Kubernetes makes it easy to support features like rolling updates and autoscaling.

Amazon’s Elastic Kubernetes Service ( EKS ) handles some of the management challenges of running a production Kubernetes cluster. It adheres to the Kubernetes APIs, and while the management layer isn’t free, it’s fixed and doesn’t scale with the overall workloads running on it. eksctl simplifies the experience of launching EKS clusters.

2. Instance selection

AWS offers many different types of instances . Selecting the right instance is a function of models’ compute and memory requirements, acceptable web service latency, and infrastructure budget.

GPU infrastructure tends to speed up complex deep learning model inference, but CPU infrastructure may be more cost effective for simpler models. Some state-of-the-art models like OpenAI’s GPT-2 demand a lot of memory for a single prediction, so it’s important that each container gets access to sufficient memory resources.

3. Autoscaling containers

As the number of concurrent inferences increases to more than a single container can handle, additional containers will be launched. This allows the cluster to request resources based on the volume of inference requests. An autoscaling process will spin containers up or down as a function of cumulative resource utilization. If you’re running a Kubernetes cluster, the Kubernetes horizontal pod autoscaler can help with automatically triggering scaling events.

In addition, different models should be loaded into different containers to decouple their resource requests. For example, let’s assume your cluster is running two web services: each container of service A requests 1 GPU, while a container of service B requests1 CPU. An increase in traffic to service B will be scheduled on much cheaper CPU resources without requiring additional GPUs.

4. Autoscaling instances

In addition to choosing the appropriate instance types for your inference workload, the simplest way to optimize spending is to automatically adjust the size of the cluster based on the aggregate resource requests of all containers on the cluster. In other words, if 5 instances can handle all the traffic, there is no reason to have 8 instances in your cluster. The aggregate number of hours that instances are running is the main driver of cost.

Before scaling down an instance when traffic decreases, it’s important to move all its containers onto other instances to ensure that there is no degradation in the performance of any web service.

One potential drawback to autoscaling instances is that a sudden increase in traffic may not be able to be handled until new instances spin up, which isn’t instantaneous. You can address this by reserving excess capacity on your cluster to provide sufficient buffer. If you’re running a Kubernetes cluster, the Kubernetes cluster autoscaler can help with automatically launching and terminating EC2 instances.

5. Load balancing

Intelligent load balancing reduces the average request latency. A round robin protocol for handling requests ensures that no single replica is overloaded while other replicas are idle.

Kubernetes integrates tightly with AWS’s Elastic Load Balancing ( ELB ), and Istio can be used to support traffic splitting to power features like model A/B testing and canary deployments.

6. Collecting metrics

Production web services need to be monitored. For machine learning APIs, it’s especially important to track predictions to ensure that models are performing as expected. Each container should include a metrics agent like StatsD that makes asynchronous requests to a metrics backend without blocking the replica from running more inferences. It’s extremely wasteful to block a GPU while waiting for a service like CloudWatch to respond.

7. Aggregating logs

In addition to tracking predictions, it’s a good idea to collect logs from each container in order to enable simple debugging. However, the dynamic and distributed nature of containers makes this challenging. A tool like Fluentd is necessary to stream all logs from each container to a central service like CloudWatch that aggregates them.

8. GPU and CPU instance groups

For many use cases, the request payloads will need to be pre-processed before being passed to the model for inference (e.g. tokenizing text). In addition, some inference outputs must be post-processed in order to produce a useful response to a client (e.g. returning a human-readable label). Although inference is often made more efficient by utilizing GPU infrastructure, it may be wasteful to run pre-processing and post-processing code on GPUs.

Creating a Kubernetes cluster that runs on top of CPU and GPU Auto Scaling Groups helps address this problem. Incoming request payloads and inference outputs can be processed in containers running on CPU instances, while inference can run in containers that are configured to use GPUs.

9. Spot instances

AWS offers heavily discounted instances called Spot Instances . The catch is that AWS may interrupt your Spot Instances with little warning. However, this isn’t a major issue for stateless web services if your infrastructure handles failovers gracefully.

There are different ways to achieve this but the design decisions listed earlier make this relatively simple. Each model is already running in decoupled containers that are orchestrated using Kubernetes. Kubernetes handles instance failures and reschedules containers on alternative instances. For our purposes, AWS reclaiming a Spot Instance is similar to any other instance failure.

Prefer to focus on data science?

If you’d rather not build this yourself, SageMaker is a managed service from AWS that simplifies deploying machine learning models in production. Alternatively, Cortex is an open-source platform that deploys machine learning models as web APIs. It’s designed to be self-hosted on AWS and takes advantage of the design principles listed above while abstracting their complexity.

1. Containerization

2. Instance selection

3. Autoscaling containers

4. Autoscaling instances

5. Load balancing

6. Collecting metrics

7. Aggregating logs

8. GPU and CPU instance groups

9. Spot instances

Prefer to focus on data science?

Recommend

代码写成这样，老夫无可奈何！

使用Spark Streaming SQL进行PV/UV统计

用树实现客户端红点系统

两人同学，十年后A努力工作年薪50万。B醉心炒房炒股年薪10万但资产千万，谁赢？

惊险！一颗小行星近距离飞掠地球

维密“失宠”，内忧外患

百度号召全员节俭过日子：喝水别用纸杯，擦手少用张纸 - Baidu 百度 - cnBeta.COM

还用不靠谱的Ghost？一款真正良心的装系统神器 - 系统工具 - cnBeta.COM

安卓标杆旗舰突然上架：重要配置全曝光 - Google 谷歌 - cnBeta.COM

高铁外放男要求叶璇道歉：视频影响很大让儿女脸上无光 - 人物 - cnBeta.COM

About Joyk