39

Snorkel — A Weak Supervision System

 4 years ago
source link: https://www.tuicool.com/articles/7VJv6vV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Today’s powerful models like DNN’s, produce state-of-the-art results on many tasks and they are easier to spin up than ever before ( using state-of-the-art pre-trained models like ULMFiT and BERT ). So, instead of spending the bulk of our time carefully engineering features for our models, we can now feed in raw data — images, text etc. to these systems and they can learn their own features. There is a hidden cost to this success though — these models require massive labeled training sets. And these labeled training sets do not exist for most real world tasks or are rather small. And creating these large labeled training datasets can be expensive, slow, time-consuming or even impractical (privacy concerns) at times. And the problem is made worse when domain expertise is required to label data. Also, tasks can change over time and hand-labeled training data is static and doesn’t adapt to change over time.

A team at Stanford, developed a set of approaches broadly termed as ‘weak supervision’ to address this data labeling bottleneck. The idea is to programmatically label millions of data points.

There are various ways that we can programmatically generate training data using heuristics, rules-of-thumb, existing databases, ontologies, etc. The resulting training data is called weak supervision : it isn’t perfectly accurate, and possibly consists of multiple distinct signals that overlap and conflict

Examples that can be thought of as sources of weak supervision include:

  • Domain heuristics (e.g. common patterns, rules of thumb, etc.)
  • Existing ground-truth data that is not an exact fit for the task at hand, but close enough to be useful (traditionally called “distant supervision”)
  • Unreliable non-expert annotators (e.g. crowdsourcing)

Snorkel is a system built around the data programming paradigm for rapidly creating, modeling and managing training data.

The data programming paradigm is a simple but powerful approach in which we ask domain expert users to encode various weak supervision signals as labeling functions , which are simply functions that label data, and can be written in standard scripting languages like Python. These labeling functions encode domain heuristics using common patterns via regular expressions, rules of thumb, etc. The resulting labels that are produced are noisy and could conflict with each other.

In Snorkel, the heuristics are called Labeling Functions (LFs). Here are some common types of LFs:

  • Hard-coded heuristics: usually regular expressions (regexes)
  • Syntactics: for instance, Spacy’s dependency trees
  • Distant supervision: external knowledge bases
  • Noisy manual labels: crowdsourcing
  • External models: other models with useful signals
3mqI7r3.png!web
Snorkel Labeling Function Example

After you write your LFs, Snorkel will train a Label Model that takes advantage of conflicts between all LFs to estimate their accuracy. By looking at how often the labeling functions agree or disagree with one another, we learn estimated accuracies for each supervision source (e.g., an LF that all the other LFs tend to agree with will have a high learned accuracy, whereas an LF that seems to be disagreeing with all the others whenever they vote on the same example will have a low learned accuracy). And by combining the votes of all the labeling functions (weighted by their estimated accuracies), we’re able to assign each example a fuzzy “noise-aware” label (between 0 and 1) instead of a hard label (either 0 or 1). Then, when labeling a new data point, each LF will cast a vote: positive, negative, or abstain. Based on those votes and the LF accuracy estimates, the Label Model can programmatically assign probabilistic labels to millions of data points. Finally, the goal is to train a classifier that can generalize beyond our LFs.

ueEj2mR.png!web
Snorkel Model

Three big pros of this approach are:

1. We’ve improved the scalability of our labeling approach: each LF can contribute label information to tens, hundreds, or thousands of examples — not just one.

2. We now have a use for unlabeled data. We can apply our LFs on all the unlabeled examples to create a whole lot of not perfect, but “good enough” labels for a potentially huge training data set.

3. These labels can be used to train a powerful discriminative classifier with a large feature set that generalizes beyond the reasons directly addressed by the LFs. (So even if we only use 100 LFs, the examples they label may each have thousands of features whose weights are learned by the discriminative classifier).

So by getting large volumes of lower quality supervision in this way and using statistical techniques to deal with noisier labels, we can train higher-quality models.

Some companies that have used Snorkel’s Weak Supervision tool —


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK