Some Things I Wish I Had Known Before Scaling Machine Learning Solutions: Part I

Recently, I’ve been touring different conferences presenting a talk about best practices for implementing large scale machine learning solutions. The idea is to present a series of non-obvious ideas that result incredibly practical in the implementation of machine intelligence applications in the real world. All the lessons have been based on our experiences at Invector Labs working with large organizations and ambitious startups in the implementation of machine learning capabilities. During those exercises, we quickly realized that many of our assumptions of machine learning apps were really flawed and that there was a huge gap between the advancements in AI research and the practical viability of those ideas. In this two-part article, I would like summarize some of those ideas that hopefully will result valuable to machine learning practitioners and aspirational data scientists.

There are many challenges that surface in the implementation of real world machine learning solutions. Most of them are related to the mismatch between the lifecycle of machine learning programs and traditional software applications. With some exceptions, traditional software applications follow a relatively sequential model from design to production. Machine learning models, on the other hand, follow a circular lifecycle that include aspects such as regularization or optimization that have no equivalent in the current toolset of traditional software applications.

Each of the stages in the lifecycle of machine learning solutions introduces unique sets of challenges that have no equivalent in the traditional software world. Some of those challenges are non-trivial or even paradoxical and can be encountered on different shapes or forms. Some of the key areas of challenges are summarized in the following figure:

The good news is that most of those challenges are solvable with the current generation of machine learning frameworks and tools. However, some of the solutions are far from obvious. Let’s look at some of the key challenges and solutions across the lifecycle of machine learning programs.

15 Lessons About Scaling Machine Learning Solutions

Strategy & Processes

Planning and strategizing is a key element in the adoption of machine learning best practices, specifically in large organizations. During the strategizing phase, there are a few challenges that become very visible:

Challenge: Data Scientist Make Horrible Engineers

No offense to the data science community intended :wink: but most data scientists don’t tend to think about engineering capabilities such as code readability, testing or deployment. As a result, many of the models created by data scientists need to be heavily refactored in order to be operationalized.

The most successful organizations that I’ve seen address the data scientists code quality challenge allocate a specific team to operationalize models. That team is often referred to as data engineering and their responsibility is to refactor and sometimes even rewrite data science models to make them production ready.

Challenge: Neither Agile nor Waterfall Processes Work for Machine Learning

Agile and waterfall methodologies are the two biggest schools of thought when comes to software development. When applied to machine learning applications, waterfall models fall short as most of the requirements are not known upfront and estimating the time for creating a specific model is next to impossible. Similarly, agile methods fail as shorter iterations are often impractical for machine learning models.

Although I don’t claim to have any answers to the right methodology to use for machine learning applications, an approach that has been relatively effective is to dive the development processes both into segments that can be approaches using both agile and waterfall methodologies respectively.

Data Engineering

Collecting and preparing datasets is one of the often underestimated efforts in machine learning solutions. In this phase, there are several challenges that need to be confronted by machine learning teams.

Challenge: Feature Extraction can Become a Reusability Nightmare

Feature extraction is one of the common aspects in the lifecycle of machine learning solutions. Conceptually, feature extraction focuses on identifying the key aspects of the data that can be used by machine learning models. While feature extraction is conceptually simple for a single model, the picture gets really complicated for organization building dozens of machine learning models that share a common set of features.

One of the most effective techniques I’ve seen to address the feature reusability challenge is to build a centralized feature store that maintains a persistent representation of the features used by the different machine learning models. This is the approach followed by stacks such as Uber’s Michelangelo.

Challenge: Labeled Datasets are Incredibly Hard to Produce

Supervised learning models dominate the machine learning ecosystem and they typically require large volumes of labeled datasets. However, producing those datasets is incredibly difficult, resource intensive and typically results impractical for most organizations.

Automated data labeling is an effective approach to deal with the data labeling nightmare. The principle is to create routines that can probabilistically assign labels to training datasets. From the technology stacks in the market, project Snorkel is one that has been steadily gaining traction in this area.

Model Experimentation

Experimentation is the cornerstone of any machine learning development lifecycle. The ability to play and test different models and architectures many times represent the difference between success and failure in the machine learning world. However, experimentations also introduces its own set of challenges in the machine learning lifecycle.

Challenge: The Single Framework Fallacy

Large enterprise cherish the idea of technology consolidation and like to place their efforts into a small number of machine learning tools and frameworks. However, frameworks that are good for experimentation often fall short for production workloads and vice versa. As a result, it is very common for organizations to leverage different machine learning stacks for the experimentation and operationalization stages respectively which introduces certain levels of technical debt and fragmentation.

When comes to machine learning, optimizing for productivity is a better solution than optimizing for consistency. As a result, accepting a world in which companies use different machine learning frameworks should be the standard. An approach that we’ve seen effective is this area is to use an intermediate representation to port models across the different frameworks. ONNX is one of the most robust frameworks to facilitate that.

In the second part, we will continue with new challenges and solutions of machine learning solutions in the real world.

15 Lessons About Scaling Machine Learning Solutions

Strategy & Processes

Challenge: Data Scientist Make Horrible Engineers

Challenge: Neither Agile nor Waterfall Processes Work for Machine Learning

Data Engineering

Challenge: Feature Extraction can Become a Reusability Nightmare

Challenge: Labeled Datasets are Incredibly Hard to Produce

Model Experimentation

Challenge: The Single Framework Fallacy

Recommend

Building and Sharing a Shiny App

使用vue-i18n实现多语言切换效果

Having a non-typical tech stack helped us get better candidates

Enjoy writing CSS in your JS? Try writing JS in your CSS! ?

The problem of connecting to Smart Meters was solved 30 years ago! Powered by sc...

Straight-to-the-point startup guide for webpack

天天给 App 抓包，还不懂 HTTP 代理吗？ | 实用 HTTP

The best way to map an entity version property with JPA and Hibernate

并发 - Go 语言学习笔记

浅谈 react 基本合成事件类

About Joyk