How internal politics interfere with data science

How companies can help data scientists do their job

May 28 ·4min read

uMVvErA.jpg!web

Source: Pexels

Machine learning in production is a fairly new phenomenon, and as such, the playbook for managing data science teams that build production ML pipelines is still being written. As a result, well-intentioned companies often get things wrong, and accidentally institute policies that, while designed to improve things, actually hamper their data scientists.

I work with a lot of data science teams (for context, I contribute to Cortex , an open source model serving platform) and I constantly hear stories about weird internal policies that—while designed to make things smoother—make their day-to-day harder.

Policies like…

1. All cloud resources need a formal request

If your team doesn’t use local machines, then just about every operation you do requires cloud resources. Processing data, training and evaluating models, experimenting in notebooks, deploying models to production — everything requires a server.

I’ve worked with more than one team in which any request for cloud resources needed to be submitted formally for approval. That means nearly any new ML-related operation required a request process.

Typically, companies adopt this setup for security and control purposes. The fewer people have cloud access, the thinking goes, the fewer opportunities for mistakes. The problem is that development slows to a crawl in these environments. Instead of exploring new ideas or building new models, data scientists spend days navigating the red tape required to get an EC2 instance.

Giving data scientists cloud privileges, at least enough that they can independently do the basic cloud operations required for their job, would significantly increase the speed at which teams move.

2. GPUs are fine — but only for training

One data science team we worked with was actually given their own AWS accounts, one for dev and another for prod. Their dev account, where they did model training, had access to GPUs. Their prod account, where models were deployed, did not.

On some level, you can kind of see where the DevOps team was coming from. They were held accountable for cloud spend, and GPU inference can get expensive. If data scientists were getting by without it before, why did they need it now?

In order to get GPUs, the data science team would need to prove that the models they were deploying couldn’t generate predictions with acceptable latency without GPUs. That’s a lot of friction—particularly when you realize they were already running GPU instances in dev.

3. Notebook instances must be monitored nonstop

One data science team had some weird policies when it came to notebooks.

Basically, the IT department was very vigilant in making sure there was no unnecessary spending, specifically on notebook instances. They were aggressive in making sure that instances were deleted as soon as possible, and were required to sign off on any new notebook instances.

In theory, holding someone responsible for how resources are managed is a good thing. However, holding someone accountable for managing another team’s resources is clearly a recipe for disaster. Instead of ending up with a lean, financially responsible data science team, these companies burn money on wasted time, as data scientists spend hours twiddling their thumbs waiting for approval from IT.

In addition, this is terrible for morale. Imagine needing to file a ticket just to open up a notebook? Or having to answer probing questions about any long-running notebook instance?

Infrastructure makes a better safeguard than policy

As frustrating and bizarre as these policies can seem, they’re all designed for legitimate purposes. Companies want to institute safeguards to control costs, create accountability, and ensure security.

The mistake, however, is that these safeguards are instituted through management policies when they should be baked into the infrastructure.

For example, instead of having a policy for requesting cloud resources, some of the best teams we work with simply give data scientists IAM roles with tightly scoped permissions. Similarly, instead of having another team constantly monitoring data scientists, many companies give the data science team a provisioned-but-limited sandbox to experiment in.

We think about how to solve these issues via infrastructure all the time when building our model serving platform . For example, we want to lower inference costs, but we also want to give data scientists the latitude to use the best tool for the job. Instead of putting limits around GPU inference, we built out spot instance support. Now, teams can use GPU instances with ~90% discounts.

When you try and enforce these safeguards through management policies, you introduce a complex web of human dependencies and a hierarchy of authority. Inevitably, this slows progress and introduces friction.

When you institute these safeguards as features of your infrastructure, however, you enable your team to move independently and quickly—just with some guardrails.

How companies can help data scientists do their job

1. All cloud resources need a formal request

2. GPUs are fine — but only for training

3. Notebook instances must be monitored nonstop

Infrastructure makes a better safeguard than policy

Recommend

3 Highly Practical Operations of Pandas

Raspberry Pi 4 Gets Its 8 Gigs

蔚来高管解读一季度财报：具备在中国市场上市的可能性

中国科学院海洋研究所首次在深海热液区发现气态水

Youtube删除新冠谣言出现附带损伤将内容审核推向风口浪尖

海信集团混改：海信视像控股股东拟引入战投，国资或放弃控股

TechEmpower Framework Benchmarks Round 19

华为发布新一代OceanStor存储Pacific系列面向海量数据存储

加浓美式是真的提神

Google launches Android Studio 4.0 with Motion Editor, Build Analyzer, and Java...

About Joyk