Powering Millions of Real-Time Decisions with LyftLearn Serving

Photo of a network by Alina Grubnyak on Unsplash

Hundreds of millions of real-time decisions are made each day at Lyft by online machine learning models. These model-based decisions include price optimization for rides, incentives allocation for drivers, fraud detection, ETA prediction, and innumerable others that impact how riders move and drivers earn.

Making real-time inferences with machine learning (ML) models at scale is complex. The complexity arises across two planes of model serving:

Data Plane: Encompasses steady state entities such as network traffic, CPU/memory consumption, and model inference.
Control Plane: Encompasses moving parts such as (un)deployment, model retraining, model naming and versioning, experimentation, backward compatibility, etc.

Numerous teams at Lyft have use cases for real-time predictions. To support them, we set out to make an easy, fast, and flexible way to serve models online by making the management of the data plane and control plane as streamlined as possible while not hiding them from modelers.

To craft a seamless model serving experience, we had to overcome two sets of technical challenges:

Variety of user requirements: Different teams care about different system requirements, such as extremely tight latency limits (single-digit millisecond), high throughput (>10⁶ RPS), ability to use niche ML libraries, support for continual learning, etc. This leads to a vast operating environment which is technically challenging to create and maintain.
Constraints imposed by our legacy system. We had a monolithic service that was already in-use for serving models. While it satisfied some of the technical challenges it also imposed several constraints. For instance, the monolithic design restricted the libraries and versions that could be used for different models which led to operational problems like unrelated teams blocking each other from deploying and unclear ownership during incidents.

Diagram of LyftLearn Serving requirements

LyftLearn Serving Requirements. The bar width represents the rough span of the model serving requirements.

To address these challenges, we built a key component for our ML platform: LyftLearn Serving. LyftLearn Serving is a robust, performant, and decentralized system for deploying and serving ML models; it can be used by any team at Lyft to easily infer models online through network calls. LyftLearn Serving is closely coupled with our ML development/prototyping environment, LyftLearn.

In the following sections, we describe major components and important design decisions for LyftLearn Serving. Following this deep dive, we share a summary of the key ideas, learnings, and next steps.

LyftLearn Serving — Major Components & Considerations

Microservice Architecture

At Lyft, most software systems are built using the microservices architecture. LyftLearn Serving is no different and leverages the excellent microservices tooling available at Lyft for testing, networking, operational management, and much more.

The main microservice of LyftLearn Serving is depicted in the following diagram within the bold rectangle:

LyftLearn Serving Microservice and its relationship with other tooling

The LyftLearn Serving runtime consists of:

HTTP Serving Library: The HTTP server interface is mostly powered by Flask. We have some internal fine-tuning on top of open-source Flask to optimize with consideration for the Envoy load balancer and the underlying Gunicorn web server.
Core LyftLearn Serving Library: This library is the crux of the business logic of LyftLearn Serving, housing the various capabilities needed by the customers of the ML platform. This library contains logic for model (un)loading, model versioning, request handling, model shadowing, model monitoring, prediction logging, etc.
Custom ML/Predict Code: This is a flexible Python interface fulfilled by ML modelers that enables them to inject any code during the runtime of LyftLearn Serving. This interface surfaces functions like load and predict function, described in more detail in a following section.
Third-Party ML Library: The majority of ML models use third-party modeling frameworks such as TensorFlow, PyTorch, LightGBM, XGBoost, or a proprietary framework. LyftLearn Serving does not impose any restriction on the framework as long as there is a Python interface for it.
Other components offered by the Lyft microservices architecture: The LyftLearn Serving runtime implements additional interfaces powering metrics, logs, tracing, analytics events, and model monitoring. The LyftLearn Serving runtime sits on top of Lyft’s compute infrastructure which uses the Envoy service mesh and the Kubernetes Scheduler.

Ownership & Isolation

One of the most important features of LyftLearn Serving is providing complete independence to each team in terms of their source code, deploy pipeline, ML library versions, and service runtime. With 40+ teams at Lyft using LyftLearn Serving, not having such isolation would see teams frequently impeding on each other’s progress.

Isolation could be achieved at many different levels; but, we decided to target the GitHub repo level. This was a natural decision at Lyft, as there is already a breadth of tooling to create a dedicated service from a repo.

As depicted in the below figure, each team using LyftLearn Serving gets their own isolated code repository. Depending on the complexity of their use cases, a team may use one repo or many repos, choosing to distribute their ML models across multiple.

Isolated Components

Isolated repositories help define clear ownership boundaries. For example, each repository unambiguously identifies a clear owning team for library or toolchain updates, on-call escalation paths, etc. While this is not unique to LyftLearn Serving within the Lyft microservices architecture, it is different from other serving systems which are centralized. Utilizing the existing conventions at Lyft has proven to be very beneficial for operations and maintenance.

Isolated repos also enable each team to leverage a bespoke deploy pipeline, powering independent publishing to staging or production. If one team misses a bug in PR tests and breaks the deploy pipeline, no other team gets blocked. If a change needs to be reverted, the impact is bound to one team’s resources.

Finally, a team’s service runtimes are isolated in the Envoy service mesh and Kubernetes orchestration engine through dedicated network mesh naming. Each team can tune their container CPU, memory resources, pod replica counts, autoscaling targets, and production alarms independently. As a result, it is less complex to achieve reproducible performance.

Config Generator

Many libraries and infrastructure components discussed in the previous section work in synchrony to power LyftLearn Serving; however, stitching everything together takes a considerable amount of application configuration. Since we do not expect the ML modeler customer to know the intricacies of the underlying systems and we want them to spend as little time on setup as possible, we generate a full application config for them. This keeps customers from having to understand all the format options (Terraform, YAML, Salt, Python, or JSON). It also ensures the config files contain necessary details like runtime secrets and database entries required for correct operation. Once they run the config generator, the resulting state is a ready-to-go online microservice capable of processing network requests in the Lyft service mesh.

Config Generator for Creating Service Repositories

The config generator is based on the Yeoman generator. An ML modeler onboarding to LyftLearn Serving for the first time runs the generator, answers a couple of questions and gets a fully populated GitHub repo with functional code and config. The generated repo includes a few working examples of how to write custom inference code and satisfy LyftLearn Serving interfaces. Once the generated code is merged and deployed, the customer gets a fully working LyftLearn Serving microservice ready to load new models and process requests.

Model Self-Tests

Given the several moving pieces in the control plane, it is important to ensure correctness of models. For instance, dependency versions can restrict model backwards compatibility. To guarantee that models continue to work as expected during continuous changes to the underlying training or serving container images, we have a unique family of tests applicable to LyftLearn Serving called model self-tests.

ML modelers specify a small amount of samples for the model inputs and expected outputs in a function called test_data, as illustrated below. The specified test data is saved and packaged up alongside the model binary itself.

class SampleNeuralNetworkModel(TrainableModel):
    @property
    def test_data(self) -> pd.DataFrame:
        return pd.DataFrame(
            [
                # input `[1, 0, 0]` should generate output close to `[1]`
                [[1, 0, 0], 1],
                [[1, 1, 0], 1],
            ],
            columns=["input", "score"],
        )

Model self-tests run small predictions on models using the test data and ensure the actual results are close enough to the expected results. Model self-tests run in two distinct locations:

At runtime in LyftLearn Serving facets: After loading every model, the system evaluates test_data and generates logs and metrics for the ML modelers so they can address any failures.
Anytime a new PR is created: CI evaluates all models loaded in a LyftLearn model repo against the previously stored test data.

LyftLearn Serving Interfaces

ML modelers have the ability to modify the runtime of LyftLearn Serving using the load and predict functions mentioned earlier:

def load(self, file: str) -> Any:
 <CUSTOM LOADING CODE HERE>

def predict(self, features: Any) -> Any:
 <CUSTOM PREDICT CODE HERE>

load is called whenever a model needs to be loaded into memory. It implements deserializing an ML model object from the file it was saved to during training time. The predict function handles online inference. Its call frequency is on the order of the number of requests served by the LyftLearn Serving microservice.

Customers can write any Python code in their load and predict functions subject to a short list of restrictions. The specific implementation is dependency injected into the LyftLearn Serving’s model inference runtime enabling the platform to generalize to many different use cases.

Lifetime of an Inference Request

How an Inference Request is handled by LyftLearn Serving

An example of an inference request to a LyftLearn Serving and its response might look like:

POST /infer
{
  "model_id": "driver_model_v2",
  "features": {
    "feature1": "someValue",
    "feature3": { 
      "a": "a",
      "b": 4.9}
}}

{
  "output": 9.2
}

This HTTP request is received by a Flask / Gunicorn server. The view function for the infer route is provided in the LyftLearn Serving core library. First, the ML platform code retrieves the model by a given model_id and executes a few key tasks such as input feature validation and model shadowing. Next, the dependency injected custom ML predict code is executed. This custom code usually pre-processes the input features and makes a prediction using an underlying third-party ML library’s prediction interfaces (such as LightGBM predict) to return the prediction output. Lastly, more platform code is executed to emit stats, logs and generate analytics events that track the performance and correctness of the predictions. Eventually the prediction output is returned to the caller in an HTTP response.

Development Flow With LyftLearn Serving

The development flow for any task relating to LyftLearn Serving begins by taking a look at the docs. Documentation is a first class citizen in the LyftLearn Serving project. It follows the Diátaxis framework, arranged into 4 modes: tutorials, how-to guides, technical references, and discussions. A brand new ML modeler who does not know much about online model inference may start at Tutorials > Getting Started, while an ML modeler looking to shadow a new model within an existing LyftLearn Serving service might start at How-to Guides > How to Enable Model Shadowing.

There are two primary interfaces through which developers can modify the LyftLearn Serving runtime: the model repo and the LyftLearn UI. LyftLearn UI is an ML computation web application that modelers use for model iteration and management. The following diagram shows some of the functionality that can be controlled through the UI (e.g. one-click deploys, monitoring) and the model repo (manipulating the deploy pipeline, model CI/CD):

Interfaces for Modifying the LyftLearn Serving Runtime

Having this duality of interfaces enables different types of ML modelers (i.e. software engineers, data scientists) to use modes that are most suitable for their jobs and skill set.

Summary & Learnings

In summary, the key design axioms of LyftLearn Serving are the following:

Model serving as a library
Distributed serving service ownership
Seamless integrations with development environment
User-supplied prediction code
First-class documentation

While building LyftLearn Serving, we learned several lessons:

Define the term “model”. “Model” can refer to a wide variety of things (e.g. the source code, the collection of weights, files in S3, the model binary, etc.), so it’s important to carefully define and document what “model” refers to at the start of almost every conversation. Having a canonical set of definitions in the ML community for all of these different notions of “models” would be immensely helpful.
Supply user-facing documentation. For platform products, thorough, clear documentation is critical for adoption. Great documentation leads to teams understanding the systems and self-onboarding effectively, which reduces the platform teams’ support overhead.
Expect model serving requests to be used indefinitely. Once a model is serving inference requests behind a network endpoint, it’s likely to be used indefinitely. Therefore, it is important to ensure that the serving system is stable and performs well. Conversely, migrating old models to a new serving system can be incredibly challenging.
Prepare to make hard trade-offs. We faced many trade-offs such as building a“Seamless end-to-end UX for new customers” vs. “Composable Intermediary APIs for power customers” or enabling “Bespoke ML workflows for each team” vs. enforcing “Rigor of software engineering best practices”. We made case-by-case decisions based on user behavior and feedback.
Make sure your vision is aligned with the power customers. It’s important to align the vision for a new system with the needs of power customers. In our case that meant prioritizing stability, performance, and flexibility above all else. Don’t be afraid to use boring technology.

What’s Next?

We made LyftLearn Serving available internally at Lyft in March 2022. The ML Platform team then started the migration efforts to move online models from our legacy service to LyftLearn Serving. Using various techniques, we quickly completed the migration — a topic that deserves a separate blog post covering its unique challenges and our solution. The platform is now continuously receiving new feature requests and has a growing user base.

Powering Millions of Real-Time Decisions With LyftLearn Serving | by Mihir &...

Powering Millions of Real-Time Decisions with LyftLearn Serving

LyftLearn Serving — Major Components & Considerations

Microservice Architecture

Ownership & Isolation

Config Generator

Model Self-Tests

LyftLearn Serving Interfaces

Lifetime of an Inference Request

Development Flow With LyftLearn Serving

Summary & Learnings

What’s Next?

Recommend

广东人有多爱“刮刮乐”？

浅谈群论 - kid_magic

zkRollup 生态项目 Soveregin 完成 740 万美元种子轮融资，Haun Ventures 领投

[数据结构] 根据前中后序遍历中的两种构造二叉树 - Amαdeus

C#开发PACS医学影像三维重建(十四):基于能量模型算法将曲面牙床展开至二维平面 - 乔克...

One UI 5.1: All the new features coming to your Samsung phone

Realme GT Neo 5 that charges in 9 minutes will launch next week

Object Sort Order for bwlabel, bwconncomp, and regionprops

The Secret Emergency Brake In Most Electric Vehicles

WebStorm 2023.1 EAP #2: TypeScript Updates, Performance and Version Control Impr...

About Joyk