4

“Just What I Needed”: Making Machine Learning Scalable and Accessible at Grubhub

 3 years ago
source link: https://bytes.grubhub.com/just-what-i-needed-making-machine-learning-scalable-and-accessible-at-grubhub-24734cc4139d
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

“Just What I Needed”: Making Machine Learning Scalable and Accessible at Grubhub

Image for post
Image for post
Photo by Tiard Schulz on Unsplash

Data scientists at Grubhub develop and deploy predictive models to improve business decision-making, as well as in-app diner, driver, and restaurant experiences. Until recently, taking models from the prototype stage to run as scheduled jobs in production was both challenging and time-consuming, as we lacked suitable infrastructure and standardized tools. This, in turn, resulted in multiple bespoke solutions, duplicated code, and a good deal of model maintenance overhead.

Over the past few months, data scientists and machine learning engineers have partnered to develop our first machine learning platform: a suite of tools designed to democratize and increase the velocity of machine learning model deployments. This post will summarize the need for such a framework, as well as provide a technical overview of our particular implementation and lessons learned during the development process.

The Problem

Our data science teams are embedded in business, product, and technology groups in order to solve problems and implement new solutions with data. This decentralized organization comes with certain advantages, namely faster iteration cycles, but also creates unevenness in technical ability. Over time, some teams have developed their own custom pipelines for model deployment, while others, lacking software engineering expertise of their own, have not benefited from these tools.

This has meant that, in practice, production machine learning at Grubhub has historically been accessible to only a small number of teams. Moreover, the differing stacks on which models were deployed created technical debt and limited our ability to scale solutions across the organization.

Introducing the ML Framework

To address some of these issues, we developed a platform to provide reliable, reproducible machine learning pipelines that simplify the process of training and deploying models for data scientists across the company. This platform has five goals:

  • Lower the barrier to entry for model deployment
  • Reduce time between model development and deployment
  • Minimize technical debt associated with maintaining multiple bespoke solutions for machine learning
  • Codify standards and best practices for developing and deploying models at scale
  • Provide a common forum in which data scientists can collaborate

Projects are built within this framework using a number of Python libraries that have been designed to optimize developer efficiency and improve maintainability and extensibility:

  1. Data Access Objects: This layer manages and tracks access to the data warehouse. The boilerplate mechanics of reading and writing data are abstracted away from data scientists, reducing the elements we need to worry about. This library encourages efficiency in data management in addition to providing features such as data lineage tracking, query sanitization, and schema validation.
  2. Feature Engineering: We maintain a shared feature pool to which different teams and projects can contribute. Machine learning models will often use the same features, and maintaining these features centrally reduces code redundancy, cuts time to deployment, and improves cross-team collaboration.
  3. Model job/application: The application layer handles all project level configurations, including model specification and parameters, compute resource requirements, training, evaluation, scoring, monitoring, and project builds.
  4. Utility libraries: These repositories store common code and tools shared between projects, from simple classes like a standard logger to more sophisticated interfaces such as a standard PySpark application.

The modularization of this architecture facilitates sharing of code across different data science projects. It also enforces machine learning best practices, ensures consistency in configuration, and standardizes the way in which models are deployed at Grubhub, enabling greater reproducibility and elevating the quality of our model outputs.

In the next section, we will review the process of deploying a model to illustrate how the various layers of code interact.

Developing and Deploying ML Models

Image for post
Image for post
Mapping out the workflow of a data scientist bringing a developed model to production.

We divide our model deployment step into three phases: systems design, feature engineering, and model deployment. These three phases constitute separate compartmentalized workflows and are intended to minimize context switching.

ML Production Design

After a model has been prototyped, typically in a Jupyter notebook environment, the next step involves systems design of the productionized model.

This design step is arguably the most important of the deployment process, as there is a high time cost in switching back and forth between designing, planning, and coding. To help data scientists minimize such context switching, and so that they can better predict their workload, we created a series of standard questions to guide data scientists through the planning process. Documenting these decisions and making them available to our data science community results in more optimal solutions, and ultimately makes the development process much smoother.

These are the standard questions that should be answered before productionization:

  1. What are the input data locations and how often are they updated?
  2. How is the data labeled? What’s the risk of getting stuck in a feedback loop?
  3. How are the model features defined? Are there feature engineering job requirements?
  4. What libraries does the model depend on? (e.g. scikit-learn, XGBoost, TensorFlow, etc.)
  5. Will predictions be made online or offline?
  6. What downstream jobs or processes will depend on the output of this job? What are the SLAs (Service Level Agreements)?
  7. What are the compute resources needed to execute the jobs within the agreed upon SLAs?
  8. How often should the model be retrained?
  9. What metrics will be used to evaluate the model? If the model is supervised, is there a prepared holdout set that can be used for validation?
  10. What is the plan for monitoring train and predict jobs, as well as the model performance? What are the consequences of training or prediction job failure?

Once sufficient responses have been documented and reviewed, we can begin building the pipelines. Development involves taking a proof of concept model, often tested on small-scale, sampled data, and translating it to work at scale. This process typically consists of one or more PySpark applications that handle everything from feature engineering and model training to validation and scoring.

Feature Engineering

Model features are computed in PySpark applications that run on a periodic scheduled basis. The primary benefit of computing features in jobs separate from model training or prediction is to take advantage of existing features built in previous projects. The ML Platform team maintains a series of PySpark applications and UDFs (User Defined Functions) that can be easily extended to include new features not already part of the existing feature pool.

Model Deployment

This phase includes, at minimum, developing a reproducible job to train the model, and if applicable, a job that will make predictions on new data. The training job may or may not run at the same cadence of the prediction job, and will typically produce a serialized model artifact that is stored and versioned.

Once a job has been successfully tested in the development environment, and the output validated, it can then be deployed to production, following a code review process. Deploying an offline model for batch prediction involves scheduling the feature engineering and model training/scoring jobs in Azkaban, a scheduler tool we use to manage computing resources and execution of the model run. For online models, the training job is also scheduled in Azkaban, but the trained model is converted to an object that can then be deployed in our applications for real-time scoring.

Lessons Learned

Here are a few key takeaways we learned while building the first version of our ML Framework:

  • Establishing new standards and processes can uncover previously hidden tech debt in legacy projects

Legacy projects that seemed to have been running smoothly, once audited were discovered to be in need of unplanned updates. For example, we learned that a model had not been re-trained since its initial deployment, and that the code needed to do so was in a Jupyter notebook outside of the deployment pipeline.

Tech debt is an unavoidable side effect of developing code in a changing environment, and it is crucial to account for such maintenance when planning out project deliverables.

  • Spark and distributed computing have a high learning curve

This might seem obvious, but it bears repeating. At Grubhub, much of the source data we use to prepare model training datasets is large enough that processing it necessitates a distributed computing tool like Spark.

Gaining proficiency in Spark, given its complexity, can be a significant time commitment, and creating data models, as well as writing and debugging Spark jobs, is often a bottleneck during a project’s development. We have somewhat mitigated this friction through new tooling and training sessions, and we are continuing to work to lower the barrier to entry even further.

  • Python dependency management is hard

Python has a wealth of packages maintained by the open-source community, making it simple to experiment with a variety of models and algorithms. However, when it comes to creating reliable, automated builds of a repository that has many dependencies, it is easy to fall into dependency hell if package versions are not tightly controlled.

Fortunately, we have adopted tools like pip-compile to auto-generate requirements files, and we are working with the infrastructure team to integrate containers into the dev and build process.

What’s to Come

We’re thrilled that what we’ve built so far has contributed to streamlining the machine learning and model development process at Grubhub, and we plan to democratize machine learning — and better support the needs of our business — even further.

Here are a few of the projects we have planned:

  • Integrated UIs for monitoring and data/model visualization

Currently, much of the metadata associated with model development exists in disparate sources like configuration files, database tables, and monitoring tools. We plan to integrate surface such information to make it easy for anyone in the company to search past experiments, view model validation reports, and share data visualizations. These dashboards will also enable us to monitor model performance, and feature drift over time.

  • A Feature Metastore service for improving discoverability and sharing of existing features

We want to expand on existing efforts to consolidate feature engineering work by making a Feature Metastore service that will enable discovery and reuse of existing features as well as publishing of new features.

  • More automation

A good deal of work has already been done to abstract away the most common aspects of data pipeline applications, but there is still plenty of boilerplate code that needs to be written for a new project. We would like to make it possible for new projects to be created on the fly from templates using a few criteria as inputs.

Special thanks to Robert Doherty, Damon Mok, and Michelle Koufopoulos on the writing of this blog post.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK