Practical Lessons for Scaling Machine Learning Solutions in the Real World

Building machine learning solutions at scale remains a huge challenge. Here are some non-trivial lessons that should be considered.

Is a new year and the artificial intelligence(AI) community is going crazy with their predictions for 2020. Instead of thinking about flashy milestones, my wish is that in this new year we make progress streamlining the end to end lifecycle of large scale machine learning solutions. Despite all the progress in machine learning stacks, the implementations of large scale solutions remains a difficult challenge for most organizations.

My company Invector Labs focus on building AI solutions for large enterprises and startups. As a results, our teams are constantly faced with challenges that fall outside machine learning theory. Last year, I presented a session about practical lessons learned during the implementation of large scale machine learning solutions. The session presented 10 practical and not-so-obvious patterns that we’ve observed in deep learning solutions implemented by teams at Invector Labs . Whereas we can find a lot of advanced research on different areas of deep learning, the reference architectures for implementing real world solutions remain a relatively nascent space. As a result, most data science teams in the industry are just learning about the building blocks that are needed to implement deep learning solutions at scale.The goals of the presentation were to highlight some key practical points that organizations should consider when embarking in their data science journey. Here are my favorite 10:

The entire slide deck is available below and I would like to use this post to summarize the 10 lessons included in the presentation.

Let’s go through a quick analysis of the top lessons:

Lesson #1: Data Scientists Make Horrible Engineers

Maybe an obvious point but it turns out that data scientists don’t always know how to write production-ready code. How many times do you see Jupyter notebooks with hardcoded parameters and configurations?

To address this challenge we typically recommend segmenting a data science team between data scientists who focused on experimentation, data science engineers who implement production-ready models and devops that focus on the automation, deployment and operationalization of data science workflows. Many times data science and data science engineering teams end up using completely different frameworks. At Facebook, data science teams rely PyTorch for experimentation while most of the production models are running on Caffe2.

Lesson #2: Notebooks Don’t Scale….Wait, Notebooks Do Scale Stupid

Many rookie data science teams try to take the code in Jupyter or Zeppelin notebooks and run it on production. In addition to being notoriously slow, notebooks are also difficult to parametrize or execute on scheduled basics.

To deal with this challenge, we typically package the code in a notebook into containers that can run and scale as part of a container infrastructure. However, recently we also discovered new mechanisms using tools like Papermill or nteract that allow to execute data science notebooks at scale.

Lesson #3: The Single Deep Learning Framework Fallacy

Every large enterprise organizations dreams with standardizing all their data science effort using a small group of technologies. That single-framework fallacy is only achievable in companies with a centralized data science organizations because, in most cases, different data science teams are going to independently adopt tools and frameworks that are better suited for their jobs and experience.

Instead of trying to fight the proliferation of multi-framework scenarios, we recommend to encourage it by implementing the right infrastructure. Typically, we recommend implementing common aspects of data science workflows such as data cleansing, feature extraction or hyperparameter optimization using a common infrastructure that can work across different frameworks such as TensorFlow, PyTork, Caffe2 and many others.

Lesson #4: Training is a Continuous Task

How many times do we see models deployed to production that suddenly start performing poorly with new datasets? Or how many times do you try to retrain a model just to find out that the training logic has been hard-coded in the same algorithm.

Training in deep learning scenarios, is not a one-time exercise but a continuous task. As a result, we recommend taking the time to automate the training routines for each deep learning model you create and deploy that on an infrastructure that allows on-demand and scheduled execution.

Lesson #5: Centralized Training Doesn’t Scale

I was recently talking to one of the data science teams at Uber and they mentioned that the UberEats application alone uses about 100 data science models. The effort of training those models regularly can be incredibly expensive from the computational standpoint and nearly impossible to implement using a traditional centralized architecture. Not surprisingly, the Uber engineering team recently open sourced Horovod , a distributed training framework for TensorFlow models.

Decentralized or federated training architectures is a model we recommend in organizations with a large number of deep learning models. Building the infrastructure by which a single training job can be distributed across different nodes and executed concurrently with other jobs is an effort likely to pay immediate dividends in large deep learning implementations.

Lesson #6: Feature Extraction can Become a Reusability Nightmare

Extracting features from input datasets is a required step on any deep learning architecture. The problem arises when you have many data science teams spending weeks building the same feature extraction routines over and over again because there is no way to reuse features across models.

A feature store is a component we typically recommend in large deep learning architectures. A feature store will persist the feature representation of specific input datasets in a way that can be used by other data science models. Key-value databases like Cassandra or DynamoDB are a great fit for the implementation of feature stores.

Lesson #7: Everyone Wants a Different Version of Your Model

Have you ever seen this scenario? A data science builds supervised model and trains it using a specific dataset. Then another teams comes along and decides that they like to train a new version of the model on a different dataset (maybe different sales data or a specific demographic). After that, another team wants to make a couple of minor changes to the model for their specific scenarios. Before you know it, there are 10 different versions of your original model all of which need to be trained, deployed and monitored.

Model versioning is an important element of any data science strategy. We have found AutoML to be an incredibly effective framework for creating self-service variations of specific models. Salesforce’s TransmogrifAI utilizes this approach very successfully to power variations of models in the Einstein platform.

Lesson #8: Cloud-Heavens, On-Premise Hells

Cloud AI infrastructures such as AWS SageMaker or Google Cloud ML simplify the implementation of complex deep learning solutions by providing large scale infrastructures that can automate the entire lifecycle of a data science application workflow. However, many organizations in regulated industries require to deploy and operate complex deep learning workflows on-premise which is far from being an easy endeavor.

Building a highly scalable computation cluster using technologies such as Apache Spark or Flink is a recommended step to enable the execution of deep learning workflows on-premise. In many cases, organizations end up building a hybrid infrastructure in which cloud platforms are used for prototyping and development and the on-premise infrastructure is used for production data science workloads.

Lesson #9: Regularization & Optimization are a Must

Deep learning solutions are a cycle of constant experimentation and optimizations. Models show signs of overfitting or start underperforming all the time forcing data science teams to go back and refactor the code.

Incorporating the right tools for monitoring the behavior of a model and perform regularization and optimization tasks is a recommended step on any deep learning strategy. Techniques such as model scoring and hyperparameters optimizations should be implemented from day one. Thankfully, the deep learning ecosystem has a growing number of tools and platforms in the space .

Lesson #10: Different Models Require Different Execution Patterns

Plain and simple, APIs are not for everything. Some deep learning models execute real time while others take days to complete. Enabling different execution patterns is a key element of any successful deep learning architecture.

At a basic level, we recommend considering three execution patterns for deep learning models: on-demand(APIs), scheduled and publish-subscribe. There are others but those three encompassed a relevant percentage of the deep learning scenarios in real world applications. Building the right infrastructure to enable deep learning workflows to execute on those three modalities can save a lot of headaches down the road.

Well, there you have it. Those are some of my favorite lessons to follow on the implementation of large scale deep learning solutions. As always, setting up the right infrastructure that enable rapid experimentation and iteration is a necessary step to be successful in your deep learning journey.

Practical Lessons for Scaling Machine Learning Solutions in the Real World