Amazon SageMaker Serverless Inference Now Generally Available

May 08, 2022 2 min read

Amazon recently announced that SageMaker Serverless Inference is generally available. Designed for workloads with intermittent or infrequent traffic patterns, the new option provisions and scales compute capacity according to the volume of inference requests the model receives.

Similar to other serverless services on AWS, SageMaker Serverless Inference endpoints automatically start the compute resources and scale them in and out depending on traffic, without choosing an instance type or managing scaling, and can scale instantly from tens to thousands of inferences within seconds. It is also possible to specify the memory requirements for the serverless inference endpoint. Antje Barth, principal developer advocate at AWS, explains the benefits of the new option:

In a lot of conversations with ML practitioners, I’ve picked up the ask for a fully managed ML inference option that lets you focus on developing the inference code while managing all things infrastructure for you. SageMaker Serverless Inference now delivers this ease of deployment.

Source: https://aws.amazon.com/it/blogs/aws/amazon-sagemaker-serverless-inference-machine-learning-inference-without-worrying-about-servers/

The preview of the serverless option was introduced at re:Invent 2021 and since then the cloud provider has added support for the Amazon SageMaker Python SDK and Model Registry, a capability to integrate the serverless inference endpoints with a MLOps workflow.

The need of a serverless option and alternatives to SageMaker were discussed in the past on a Reddit thread. Leveraging container image support in AWS Lambda is another approach to run serverless machine learning workloads as explained by Luca Bianchi, CTO at Neosperience.

Philipp Schmid, technical lead at Hugging Face, writes:

SageMaker Serverless Inference will 100% help you accelerate your machine learning journey and enables you to build fast and cost-effective proofs-of-concept where cold starts or scalability is not mission-critical, which can quickly be moved to GPUs or more high scale environments.

In a separate article, Schmid and co-authors from AWS explain how to host Hugging Face transformer models using SageMaker Serverless Inference. Barth adds a warning on how to handle cold-starts:

If the endpoint does not receive traffic for a while, it scales down the compute resources. If the endpoint suddenly receives new requests, you might notice that it takes some time for the endpoint to scale up the compute resources to process the requests. This cold-start time greatly depends on your model size and the start-up time of your container. To optimize cold-start times, you can try to minimize the size of your model, for example, by applying techniques such as knowledge distillation, quantization, or model pruning.

In addition to the latest serverless addition, Amazon SageMaker has other three model inference options to support different use cases: SageMaker Real-Time Inference, designed for workloads with low latency requirements in the order of milliseconds, SageMaker Asynchronous Inference, suggested for inferences with large payload sizes or requiring long processing times, and SageMaker Batch Transform to run predictions on batches of data.

Customers can create and update a serverless inference endpoint using the SageMaker console, the AWS SDKs, the SageMaker Python SDK, the AWS CLI, or AWS CloudFormation. The pricing is billed by the millisecond based on the compute time to run the inference code and the amount of data processed. There is a free tier usage per month for the first two months of "150,000 seconds of inference duration".

About the Author

Renato Losio

Renato has many years of experience as a software engineer, tech lead and cloud services specialist in Italy, UK, Portugal and Germany. He lives in Berlin and works remotely as principal cloud architect for Funambol. Cloud services and relational databases are his main working interests. He is a AWS Data Hero.

Amazon SageMaker Serverless Inference Now Generally Available

Amazon SageMaker Serverless Inference Now Generally Available

About the Author

Renato Losio

Recommend

User experience design and technical project management overlap

Ford did what Tesla won’t

多集群复杂检索工具 – Clusterpedia 入选云原生全景图

投融快讯 | 天地和兴获得数亿元D轮投资；茵络医疗完成过亿元C+轮融资；小鸟科技完成数...

Deploy Graylog Server using Ansible on Ubuntu/Debian/CentOS

Research Finds Over 1.5 Million "Abandoned" Mobile Apps

Acing the iOS Interview [SUBSCRIBER]

How An Insult From Ferrari's Founder Led To The Birth Of Lamborghini Cars

Math in Product Design — Why you need it and what to cover

比特币跌破3万美元关口，刷新10个月新低

About Joyk