Snorkel — A Weak Supervision System - JOYK Joy of Geek, Geek News, Link all geek

Today’s powerful models like DNN’s, produce state-of-the-art results on many tasks and they are easier to spin up than ever before ( using state-of-the-art pre-trained models like ULMFiT and BERT ). So, instead of spending the bulk of our time carefully engineering features for our models, we can now feed in raw data — images, text etc. to these systems and they can learn their own features. There is a hidden cost to this success though — these models require massive labeled training sets. And these labeled training sets do not exist for most real world tasks or are rather small. And creating these large labeled training datasets can be expensive, slow, time-consuming or even impractical (privacy concerns) at times. And the problem is made worse when domain expertise is required to label data. Also, tasks can change over time and hand-labeled training data is static and doesn’t adapt to change over time.

A team at Stanford, developed a set of approaches broadly termed as ‘weak supervision’ to address this data labeling bottleneck. The idea is to programmatically label millions of data points.

There are various ways that we can programmatically generate training data using heuristics, rules-of-thumb, existing databases, ontologies, etc. The resulting training data is called weak supervision : it isn’t perfectly accurate, and possibly consists of multiple distinct signals that overlap and conflict

Examples that can be thought of as sources of weak supervision include:

Domain heuristics (e.g. common patterns, rules of thumb, etc.)
Existing ground-truth data that is not an exact fit for the task at hand, but close enough to be useful (traditionally called “distant supervision”)
Unreliable non-expert annotators (e.g. crowdsourcing)

Snorkel is a system built around the data programming paradigm for rapidly creating, modeling and managing training data.

The data programming paradigm is a simple but powerful approach in which we ask domain expert users to encode various weak supervision signals as labeling functions , which are simply functions that label data, and can be written in standard scripting languages like Python. These labeling functions encode domain heuristics using common patterns via regular expressions, rules of thumb, etc. The resulting labels that are produced are noisy and could conflict with each other.

In Snorkel, the heuristics are called Labeling Functions (LFs). Here are some common types of LFs:

Hard-coded heuristics: usually regular expressions (regexes)
Syntactics: for instance, Spacy’s dependency trees
Distant supervision: external knowledge bases
Noisy manual labels: crowdsourcing
External models: other models with useful signals

Snorkel Labeling Function Example

After you write your LFs, Snorkel will train a Label Model that takes advantage of conflicts between all LFs to estimate their accuracy. By looking at how often the labeling functions agree or disagree with one another, we learn estimated accuracies for each supervision source (e.g., an LF that all the other LFs tend to agree with will have a high learned accuracy, whereas an LF that seems to be disagreeing with all the others whenever they vote on the same example will have a low learned accuracy). And by combining the votes of all the labeling functions (weighted by their estimated accuracies), we’re able to assign each example a fuzzy “noise-aware” label (between 0 and 1) instead of a hard label (either 0 or 1). Then, when labeling a new data point, each LF will cast a vote: positive, negative, or abstain. Based on those votes and the LF accuracy estimates, the Label Model can programmatically assign probabilistic labels to millions of data points. Finally, the goal is to train a classifier that can generalize beyond our LFs.

Snorkel Model

Three big pros of this approach are:

1. We’ve improved the scalability of our labeling approach: each LF can contribute label information to tens, hundreds, or thousands of examples — not just one.

2. We now have a use for unlabeled data. We can apply our LFs on all the unlabeled examples to create a whole lot of not perfect, but “good enough” labels for a potentially huge training data set.

3. These labels can be used to train a powerful discriminative classifier with a large feature set that generalizes beyond the reasons directly addressed by the LFs. (So even if we only use 100 LFs, the examples they label may each have thousands of features whose weights are learned by the discriminative classifier).

So by getting large volumes of lower quality supervision in this way and using statistical techniques to deal with noisier labels, we can train higher-quality models.

Some companies that have used Snorkel’s Weak Supervision tool —

Conversational agents at IBM: Bootstrapping Conversational Agents With Weak Supervision (AAAI 2019)
Web content & event classification at Google: Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale (SIGMOD Industry 2019) , and Google AI blog post
Business intelligence at Intel: Osprey: Non-Programmer Weak Supervision of Imbalanced Extraction Problems (SIGMOD DEEM 2019)

Snorkel — A Weak Supervision System

Recommend

A Conversation with Manfred von Thun (RIP) (2003)

Pandas for Football Player Value

[译]利用 gopackage 进行包的捕获、注入和分析

看完这篇，彻底搞定期货穿透式CTP API接入 - 知乎

What the heck is a Developer Advocate?

Using Hugo, Gitlab Pages, and Cloudflare to Create and Run a Free Static Website

让你的网页更丝滑（全）

用CSS Grid Shepherd技术对数据进行排序[每日前端夜话0x7B]

Go 字符串编码，Unicode 和UTF-8

Halo v1.0 正式发布

About Joyk