9

Active Learning: Strategies, Tools, and Real-World Use Cases

 1 year ago
source link: https://neptune.ai/blog/active-learning-strategies-tools-use-cases
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this article, you’ll learn:

  • what is active learning,
  • why do we need active learning
  • how it works
  • what techniques are there
  • where it’s used in the real world
  • and what frameworks can help with active learning.

Let’s begin!

What is active learning?

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. In statistics literature, it is sometimes referred to as optimal experimental design. – Source

It is an important technique to create a decent machine learning model while keeping the amount of supervised/labeled datasets to a minimum by selecting the seemingly most important data points.

This technique is also considered in situations where labeling is difficult or time-consuming. Passive learning, or the conventional way through which a large quantity of labeled data is created by a human oracle, requires enormous efforts in terms of man hours. 

In a successful active learning system, the algorithm is able to choose the most informative data points through some defined metric, subsequently passing them to a human labeler and progressively adding them to the training set. A diagrammatic representation is shown below.

Active learning
Diagram of active learning system | Source: Author

Why do we need active learning?

The motivation

The idea of active learning is inspired by the known concept that not all data points are equally important for training a model. Just have a look at the data points shown below. It’s a cluster of two sets with a decision boundary in between.

A cluster of two sets
A cluster of two sets with a decision boundary in between | Source: Author

Now assume a scenario with more than tens of thousands of data points without any labels to learn from. It would be cumbersome or even extremely expensive to label all those points manually. To mitigate this pain, if a random subset of data is selected among the lot and then labeled for model training, most likely, we would end up with a model with sub-par performance, as you can observe in the image below. The catch is that the decision boundary created by this random sampling can lead to lower accuracies and other diminished performance metrics.

Sub-par performance
A model with sub-par performance | Source: Author

But what if we somehow manage to select a bunch of data points near the decision boundary and help the model to learn selectively? This would be the most preferred scenario for selecting the samples from the given unlabelled dataset. This is how the concept of active learning originated and evolved.

Scenario of NLP training
Possible scenario of NLP training | Source: Author

The illustrations that you see above can be easily applied to scenarios in NLP training, as getting relevant labeled datasets for POS tagging & Named Entity Recognition, etc., could be a challenge.

Data labeling costs

Creating labeled data for large-scale training is quite expensive and time-consuming. The pricing details of Data Labelling by Google Cloud Platform/1000 units are shown below. Typically there should be 2-3 data labelers for each instance to ensure better label quality. Tier 1 is up to 50K units, and Tier 2 is above that volume.

The pricing details of Data Labelling

The pricing details of Data Labelling by Google Cloud Platform/1000 units | Source

By looking at the cost shown above, you can have a rough idea about what would be the typical labeling costs if you want to label 100,000 images for the training of a specialized ML model involving bounding boxes. Say around 5 bounding boxes and two labelers per image. The cost could turn out to be $112,000. But again, this cost could either be large or small, depending on the size of your organization and project budgets. 

Let’s look into another scenario that could happen in the healthcare space. Imagine we need to label 10,000 numbers of medical images, and each needs an average of 15 semantically segmented objects and 3 labelers for improved accuracy. The cost could be $ 391,500!

I guess now you have an idea now the cost can easily shoot up for data labeling. The numbers mentioned above are very real scenarios. This could go even substantially higher if we are training a large language model or a vision-based model on large image datasets.

Either way, if you are employing internal resources or using external labeling services, labeling time is going to be significant, which you want to avoid. Often times the data needs to be labeled by someone who has got familiarity with the domain.

Active learning can bypass these challenges considerably. Hence there is a big push in certain sectors, such as Training of Large NLP systems, Autonomous driving, etc., which are using active learning. Take a data labeling scenario in an autonomous driving system observed by NVIDIA.

A simple back-of-the-envelope calculation shows that a fleet of 100 cars driving eight hours per day would require more than 1 million labelers to annotate all frames from all cameras at 30fps for object detection. This is utterly impractical.

Guess I don’t need to emphasize the costs involved – both money and time, anymore! If still not convinced, just think about the possible carbon footprint!

Issues caused by Edge scenarios

Edge case scenarios can be as harmless as the one shown below, where the algorithm is confused between a puppy and a cupcake. Or they can be as dangerous as an Autonomous truck causing a Major accident since it has failed to detect a dark-colored vehicle crossing the road at night! Seemingly low occurrences of these edge case failures can quickly result in significant negative outcomes with high-cost implications.

Confusion between a puppy and a cupcake
The algorithm is confused between a puppy and a cupcake | Source

The number of data points dealing with edge cases in the training data is low, which ultimately results in the failure of ML models to train on the same. 

Bounding box detections
Example bounding box detections for cars and pedestrians for a day & night image | Source

Active learning systems can be trained to identify edge cases so that they can be labeled effectively by humans and can be fed to the labeled data pool with adequate representation or even at a higher weighted representation! A more detailed take on Active learning in Autonomous driving systems is given in a later section.

How active learning works: scenarios

So….memes apart! Let’s do a deep dive into how active learning works.

In a classic active learning cycle, the algorithm selects the most valuable data instances (which could be edge cases as well) and requests them to be labeled by a human. These newly labeled instances are then added to the training set, and the model is retrained. The selection of these data instances happens through a process researchers typically call querying. Quoting a line from the article Query-Based Learning:

Query-based learning is an active learning process where the learner has a dialogue with a teacher, which provides useful information on request about the concept to be learned.

The teacher mentioned in the above quote could be a basic ML model or a simple algorithm to start with, which is often trained beforehand with a small subset of labeled data. How to form a query plays a key role in developing an active learning algorithm. So whether you are looking for an ambiguous case or an edge case to learn from, it is closely dependent on how you design the query formation.

There are typically two broad scenarios of active learning in the literature: query synthesis-based and sampling-based.

Query synthesis

It’s primarily based on the assumption that instances close to the classification boundary are more ambiguous, and getting those instances labeled will provide the most information to the learning model. So we query points near the decision boundary by either creating newer points or selecting the points already nearest to it by utilizing Euclidean distance-based techniques. 

The algorithm

  1. The first step would be to create a simple model with a sample of labeled data. Let’s assume a scenario of binary classification, suppose X1 & X2 are two points belonging to two separate classes. By using an approach similar to the binary search tree, we can find instances close to the decision boundary by querying the labels of their midpoints by a human oracle. Midpoint here refers to the euclidean midpoint of the line connecting points X1 & X2. Consider this example.

Let’s take two points belonging to different classes defined by X1 (2, 5, 4), X2 (6, 4, 3). Since they have three features, finding their midpoint is a simple euclidean operation which gives the midpoints as 

Xmid point = (X1 + X2)/2 

Xmid point = (4, 4.5, 3.5)

So if we let the label of point (X1 + X2)/2 be queried, then we know which direction we need to go for further queries and land on an instance very close to the decision boundary. 

  1. But how do we know whether we have reached the decision boundary? For that, we need to define a difference in probability scores between the opposite classes. If the score difference falls below this value, we stop querying.
  1. But here, we run the risk of querying only in the local neighbourhood. So to get more points in the vicinity, we need to try some more tricks, as discussed below. But of course, if there are enough points spread out in feature space to start with, this is addressed to some extent.
  1. To query beyond the local vicinity, we can add an orthogonal vector with respect to the closest pair on the opposite class and then find out its midpoint. But how to do this? Well, we can find out the pair’s midpoint again and then find another point in its vicinity by orthogonally adding a line to it. The length of this line can be in the order of the euclidean distance between parent points. The procedure is briefly illustrated below.
Query beyond the local vicinity
The procedure of query beyond the local vicinity | Source: Author
  1. The next step is to get the label of this point (from the current model) and get the closest unlabelled point on the other class and repeat the process all over again. This way, we can generate a bunch of data instances that are closer to the decision boundary of the current model. Then obtain labels from a human oracle for these generated points and retrain the previous model with this additional data. Repeat the whole process again till it hits our model performance targets.
Visualization for generating queries
Visualization for generating queries by finding an opposite pair close to the hyperplane | Source

The above methods work quite well in theory. But what if we are dealing with a complex language model or a computer vision use case where the generated point to query doesn’t make any sense to the human oracle?

If this is a computer vision task, these generated instances might not be recognizable by a Human oracle. Hence we select the nearest neighbour from the unlabelled set and query its label. Nearest neighbour search can be through simple euclidean distance measures or cosine similarity based.

I hope you are all familiar with the MNIST dataset for handwritten digits classification. Please have a look at the set of images shown below. The one shown at the top shows a random sample of handwritten numbers in the MNIST dataset. The one at the bottom shows some numbers (3, 5, 7) selected through querying and nearest neighbour search from the same dataset. But don’t the ones at the bottom look weirder than the top? Yeah, that’s right! That’s how it’s supposed to be! Because the algorithm is looking for the edge cases or, to put it another way, the weird ones!

active-learning-in-production-ML-10.png?resize=288%2C288&ssl=1
A random sample of handwritten numbers in the MNIST dataset | Source
active-learning-in-production-ML-11.png?resize=266%2C242&ssl=1
Numbers (3, 5, 7) selected through querying and nearest neighbour search from the same MNIST dataset | Source

You can read in detail about this procedure in the article on nearest neighbour search.

Active learning using sampling techniques

Active learning using sampling can be boiled down to the following steps:

  1. Labeling a subsample of data using Human Oracle.
  2. Train a relatively light model on the labeled data.
  3. The model is made to predict the class of every remaining unlabelled data point.
  4. A score is given to every unlabelled data point based on the model outputs.
  5. A subsample is chosen based on these generated scores and sent out for labeling (the size of the subsample could depend on the availability of labeling budget/resources and time)
  6. The model is retrained based on the cumulative labeled datasets.

Repeat steps 3-6 until the model approaches desired levels of performance. At a later stage, you can increase the model complexity as well. Here are a couple of scenarios based on this.

1. Stream-based sampling 

In stream-based selective sampling, unlabelled data is continuously fed to an active learning system, where the learner decides whether to send the same to a human oracle or not based on a predefined learning strategy. This method is apt in scenarios where the model is in production and the data sources/distributions vary over time. 

2. Pool-based sampling

In this case, the data samples are chosen from a pool of unlabelled data based on the informative value scores and sent for manual labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled dataset is scrutinized for the selection of the best instances.

So now we know that from a broader perspective, active learning algorithms zero down on a subsample of the unlabelled dataset, but how does this selection happen? We discuss such strategies below that allow the selection of data samples most relevant for the model’s learning.

Active learning: strategies for subsampling

Committee based strategies

By building several models, informative samples are chosen from the models predictions. An ensemble of these models referred to here is called a committee. If we have n different models in the committee, then we can have n predictions for one data sample. Sampling could be based on voting or the variance produced (in the case of a regressor) or even based on the disagreement between these models.

There are a couple of popular ways that information or prioritization scores can be generated for each of the data samples –

Entropy

As per Wikipedia, entropy is a scientific concept as well as a measurable physical property that is most commonly associated with a state of disorder, randomness, or uncertainty.

So imagine a supervised learning scenario with three available classes to predict. The initial model predicted class probabilities such as class_1(0.45), class_2(0.40), class_3(0.15). The probabilities corresponding to the top two classes are quite close to each other, with a difference of only 0.05. That means the model is uncertain about the label it has to assign to the data instance hence resulting in close probabilities of two or more classes. Entropy is usually computed as a summation across the classes, which is shown below.

Entropy as a summation

Here ‘x’ represents each of the classes and P(y|x) their respective predicted probability. The entropy values are calculated for the unlabelled data points, and a select sample is sent for labeling. For more context, please read about active learning in machine learning.

KL-Divergence

KL-Divergence represents the difference between two probability distributions, or, puts it in another way, its a type of statistical distance: a measure of how one probability distribution P is different from a second reference probability distribution. For more understanding, please refer to this Kullback–Leibler divergence.

Two prominent committee based strategies are touched upon below from available research papers.

  1. Entropy-based Query by Bagging

In this approach, k training sets are created from the original dataset with replacement. These drawn-out subsets are fed to an equal number of models (which could differ in model types /hyperparameters) for training. Then these models are used for predictions on the unlabeled pool of data samples. The heuristic used for measuring disagreement in this approach is entropy hence the name EQB(Entropy based Query by bagging).

  1. Adaptive Maximum Disagreement (AMD)

This is achieved by splitting the feature space and providing datasets with different features to each of the models. This way, we can have different models trained on the dataset having distinct features. The metric used would be the same as the previous strategy (Entropy).

Large-margin based strategies

They are specially meant for margin-based classifiers such as SVM. In SVM, the support vectors would be the most informative points, hence the selected data points would be the ones falling around this margin. The distance to the separating hyperplane can be thought of as a good metric to measure the model’s confidence or certainty on unlabeled data samples. There are several strategies in this category that can be modified to be applied to probability-based models also.

Margin Sampling (MS)

Support vectors are the labeled data samples that lie on the margin having a distance of exactly 1 from the logical decision boundary. Margin Sampling strategy is based on the assumption that data samples which fall within the region of this margin are the most relevant for obtaining labels.

Margin Sampling

In the equation shown above F(xi, w)  represents the distance between any data sample and the hyperplane of class w. U  represents the unlabelled dataset. This strategy selects a single data sample for querying.

Margin Sampling-closest Support Vectors (MS-cSV)

This strategy first stores the positions of each data sample from the support vectors. For each support vector, a data sample is selected which has the lowest distance from that support vector. This way, we can have more than one unlabeled data sample in every iteration removing the drawback of simple margin sampling i.e. only select a single data sample for querying a human oracle per iteration.

Probability-based strategies

This is based on the estimation of class membership probabilities. Unlike margin based strategies, this suits any model which can compute probabilities associated with a data instance with respect to the classes available.

Probability-based smallest margin strategy

This accounts for the difference in class prediction probability of the highest and second highest classes. Once they are computed for the entire unlabelled dataset, a sample can be drawn based on the scores generated and sent for labeling.

Probability based smallest margin strategy

Here ‘x’ represents each of the classes and P(y|x) their respective predicted probability.

With the above relation, the instances with the lowest margin will be the ones sent for labeling first, i.e, they are the ones with the least certainty between the top two probable classes.

Least confidence

This strategy allows an active learner to select the unlabeled data samples for which the model is least confident in prediction or class assignment. So if the model predicted 0.5 for a class with the highest probability, LC value becomes 0.5.

The relation can be deducted from the form given below –

Least confidence

Expected Model Change

In this strategy, the algorithm selects the data instance which causes the maximum change in the model. This change, let’s say, can be measured by the gradientcorresponding to this data instance during the SGD process (stochastic gradient descent).

Expected Error Reduction

The algorithm tries to capture those data instances, which will reduce the error in the subsequent iteration. Data samples are progressively selected on the basis of the ability to reduce the model training error the most.

Apart from the few discussed above, some less popular metrics which are used are Maximum Normalized Log-Probability (MNLP) and Bayesian Active Learning by Disagreement (BALD).

Active learning in real-world

Active learning is quite popular in the world of NLP & Computer vision. Specifically, in the case of NLP, information extraction for POS tagging, Named Entity Recognition (NER), etc., require lots of training (labeled) data, and the cost of labeling data for these kinds of use cases is really high. Most of the advanced language models are based on deep neural networks and trained on large datasets. However, under typical training scenarios, the upper hand deep learning usually commands will diminish if the datasets are too small. Hence to make deep learning broadly useful, it is crucial to find an effective solution for the above problem. 

Active learning use case in NLP (NER)

A use case for improving a Named Entity Recognition (NER) model using active learning is discussed below. A deep dive into active learning specific to NER is discussed in this paper. They have compared the above-discussed strategies/scoring metrics against a random sample selected for training for every iteration. The data set used for benchmarking is OntoNotes 5.0.

Use case of Named Entity Recognition (NER) model using active learning
Active learning use case in NLP (NER) | Source

As we can see above, clearly, all of the active learning strategies are outperforming the random sampling (RAND) baseline performance by a good margin. 

Another representation showing the performance improvement based on different active learning strategies vs the number of iterations is shown below. The same is compared with training data obtained using random sampling techniques.

Active learning strategies vs the number of iterations
The performance improvement based on different active learning strategies vs the number of iterations | Source

Active learning use-case in computer vision (autonomous driving)

Autonomous driving could be the most promising and valuable use case right now where active learning is used and has proven its immense business value. Researchers around the world are focussed on improving the accuracy of predictions demanded by the near-perfect expectations in performance for autonomous driving systems. 

To achieve this high accuracy or performance expectations, vision-focused deep learning models require a large amount of training data. But selecting the “right” training data that captures all the possible conditions/scenarios and edge cases and that too at appropriate representational weights is a huge challenge.

The image shown below is a classic example of an edge case that could confuse and result in the potential catastrophic event in Autonomous driving systems.

Potential catastrophic event in Autonomous driving systems
A case of potential catastrophic event in Autonomous driving systems | Source

So how do we detect these kinds of scenarios for labeling? Let’s have a look at how active learning is solving these challenges for some of the leading tech firms.

A crux of observation from the article ‘Scalable Active Learning for Autonomous Driving’ from NVIDIA is presented here. To optimize the selection of data, the below-mentioned factors need to be considered –

  1. Scale: As discussed above already, a simple back-of-the-envelope calculation shows that a fleet of 100 cars driving eight hours per day would require more than 1 million labelers to annotate all frames from all cameras at 30fps for object detection. This is highly impractical.
  2. Cost: As mentioned before, the cost of labeling in vision datasets goes humongous if we aren’t careful with what to label!
  3. Performance: Selecting the right frames, for example, rare scenarios we don’t usually come across.

In the above-mentioned article, they tried to select the best training examples from an unlabelled set of 2 million frames stemming from recordings collected from vehicles on the road. The methodology starts with a pool-based sampling and a query function/acquisition function based on the disagreement between an ensemble of models. Let’s assume model 1 & model 2 predicted class probabilities of 0.78 & 0.91 for a data instance (X1) for the class with the highest probability and class probabilities of 0.76 & 0.82 for another data instance (X2). Here the class probability disagreements are (X1(0.13) & X2(0.06). Clearly, the class disagreement for X1 is higher than X2, hence X1 would be a more preferred candidate for active learning. The acquisition function is applied to unlabelled frames to select the most informative among the pool.

The following operations are performed in a loop, which is almost similar to a classic active learning algorithm, but there is a slight difference in the way querying/sampling is done, which is best suited for this particular use case.

  1. Train ‘n’ number of models with random parameters on currently labeled data.
  2. Query the samples that display maximum disagreement between the trained models.
  3. Send the selected data to Human Oracles for annotation.
  4. Append newly labeled examples to training data.
  5. Repeat steps from 1-4 until the ensemble reach the desired levels of performance.

The sample heat maps for selected frames are shown below. The heat map shows regions within these selected frames with high levels of ambiguity. These are the data points that we want to capture to make the model learn efficiently! Hence these samples can act as apt candidates for active learning when labeled by human annotators.

Active learning sample
Sample of selected frames via active learning | Source

Aside from the cost advantages, a significant improvement in mean average precision (from an objection detection perspective) was observed using active learning. 

Test data from both manual curation and active learning
Mean average precision weighted across several object sizes (wMAP) as well as MAP for large and medium object sizes on test data from both manual curation and active learning | Source

Active learning use case in the medical domain

A versatile active learning workflow for optimization of genetic and metabolic networks is a classic example of active learning in the medical domain i.e. a maximizing biological objective function (an output/target that depends on multiple factors) with minimal datasets is cited below. The image shows below the improvement of the protein production yield derived from an active learning based sampling strategy.

Protein production
The improvement of protein production | Source

Some popular frameworks used for Active Learning

1.modAL: A modular active learning framework for Python3

modAL is an active learning framework for Python3, designed with modularity, flexibility, and extensibility in mind. Built on top of scikit-learn, it allows you to rapidly create active learning workflows with nearly complete freedom.

modAL supports many of the active learning strategies discussed in the previous sections, such as probability/uncertainty-based algorithms, committee-based algorithms, error reduction, and so on. 

Active learning with a scikit-learn classifier, for instance, RandomForestClassifier, can be as simple as the following.



from modAL.models import ActiveLearner from sklearn.ensemble import RandomForestClassifier

# initializing the learner, X_training refers to the initial labeled dataset learner = ActiveLearner( estimator=RandomForestClassifier(), X_training=X_training, y_training=y_training )

# query for labels X_pool refers to unlabeled dataset query_idx, query_inst = learner.query(X_pool)

# ...obtaining new labels from the Oracle…

# supply label for queried instance learner.teach(X_pool[query_idx], y_new) code source

2. libact: Pool-based Active Learning in Python

libact is a python package designed to make active learning easier for real-world users. The package not only implements several popular active learning strategies but also features active learning by learning meta-strategy that allows the machine to automatically learn the best strategy on the fly. Here is an example usage of libact:



# declare Dataset instance, X is the feature, y is the label (None if unlabeled) dataset = Dataset(X, y) query_strategy = QueryStrategy(dataset) # declare a QueryStrategy instance labler = Labeler() # declare Labeler instance model = Model() # declare model instance

for _ in range(quota): # loop through the number of queries query_id = query_strategy.make_query() # let the specified QueryStrategy suggest a data to query lbl = labeler.label(dataset.data[query_id][0]) # query the label of the example at query_id dataset.update(query_id, lbl) # update the dataset with newly-labeled example model.train(dataset) # train model with newly-updated Dataset

3. AlpacaTag

AlpacaTag is an active learning-based crowd annotation framework for sequence tagging, such as named-entity recognition (NER). The distinctive advantages of AlpacaTag, as mentioned in its documentation, are given below.

  • Active, intelligent recommendation: dynamically suggesting annotations and sampling the most informative unlabelled instances.
  • Automatic crowd consolidation: enhancing real-time inter-annotator agreement by merging inconsistent labels from multiple annotators.
  • Real-time model deployment: users can deploy their models in downstream systems while new annotations are being made.

Annotation UI for NER use cases by AlpacaTag is shown below.

Use cases by AlpacaTag
Annotation UI for NER use cases by AlpacaTag | Source

Active Learning in production

Till now, we have tried to understand the concept of active learning, its strategies, and some of its key applications. Now let’s dive into how they are practically implemented in production systems. 

Active learning pipelines majorly fall under automatic and semi-automatic categories. Here is a brief take on them with production-ready workflows given as examples.

Active learning pipelines – semi-automatic

In a semi-automatic or semi-active learning approach, each cycle runs automatically, but it needs to be triggered manually. Key areas where manual intervention is required are at –

1. selecting the data for annotation – we need to choose the next images in an informed way. 

2. Selecting the best model out of the ensemble of models created among each cycle 

A semi-automatic pipeline needs to be closely monitored for its performance indicators, such as the model performance metrics after each active learning cycle.  Inherently this technique is prone to errors, especially when it requires a number of learning cycles.

An example of semi-automatic or semi-active Learning with AutoTrain and Prodigy

This section goes briefly about how we can use AutoTrain and Prodigy to build an active learning pipeline.

But before that, let’s have a quick look into what AutoTrain & Prodigy are!

As the name suggests, AutoNLP, now named AutoTrain, is a framework created by Hugging Face to build our own deep learning models on available datasets with very minimal coding. AutoNLP is built on state-of-the-art NLP models such as Transformers, NLP inference-API, and other tools. You can easily upload your data and the corresponding labels to initiate the training pipeline. With Auto Train, we get to use the best available models, it can automatically do fine-tuning based on the use case and serve to the end user. Hence production is ready!

Prodigy is a commercial annotation tool by Explosion. Primarily It is a web-based tool that allows you to annotate data in real-time. It supports both NLP & Computer Vision tasks. But you can use any other open source or commercial tool that best fits your use case and cost constraints!

The steps involved in running the pipeline are –

  1. Create a new Project in AutoTrain.
A new Project in AutoTrain
Creating a new Project in AutoTrain | Source:
  1. For the labeling part, Prodigy offers an interactive interface for NLP tasks. You can install it on your local machine or on cloud servers to get it started Once the prodigy is installed, you can run it in the following format –

$ prodigy ner.manual labelled_dataset blank:en dataset.csv –label label1, label2, label3

Let’s look at the arguments here,

  • blank: en is the word tokenizer.
  • Dataset.csv is your custom dataset.
  • Label1, label2….. Is the list of labels that will be used for labeling?

Once the command is run, you can go to the web UI provided by prodigy, which usually runs on one of the local host ports, and start the labeling process. The process of labeling data for a NER use case with Prodigy is shown below. For more details, please refer to the article.

Labeling data for a NER
The process of labeling data for a NER | Source

The image below shows the Prodigy web UI used for vanilla text classification labeling.

Vanilla text classification labeling
The Prodigy web UI used for vanilla text classification labeling | Source
  1. The created labeled dataset is converted into an AutoTrain readable format for uploading into your AutoTrain project. Please refer to this article for more clarity.
  1. Run the Auto Train pipeline for NER and visualize the accuracy of the ensemble of models created.
The accuracy of the ensemble of models
Visualization of the accuracy of the ensemble of models created | Source
  1. Have a look at the accuracy of the NER models, Repeat the process again from step 2 with more labeled data if you aren’t satisfied with the model performance.

The process described above is semi-active learning because there isn’t an algorithm which explicitly chooses which ones should be the best entities to label, as in real-world NER use cases, it is not feasible. Hence semi-active learning is a good start in the NER domain!

Active learning pipelines – automatic

So what happens in an automatic pipeline is that we leave the two manual intervention steps to a smart algorithm. Selecting the data for annotation could be through any one of the sampling techniques discussed in the previous section, i.e, Querying, Pool based sampling, etc. Selecting the best model could be again based on any of the performance metrics of our choice or their weighted values. Here are some examples of AL pipelines with AWS.

The one shown below is a reference architecture using Real-time classification APIs, feedback loops, and human review workflows. The primary service used here is Amazon comprehend.

A reference architecture using Real-time classification
A reference architecture using Real-time classification APIs, feedback loops, and human review workflows | Source

Suppose we have got a large text-based data set for which we need to build a classifier by leveraging Active learning. Out of the plethora of services available from AWS, Amazon Comprehend is specially built for text analytics, be it classification, uncovering insights, topic modeling, or sentiment analysis. A few of the services involved in the pipeline are briefly described below. 

  • The process starts with calling an Amazon API gateway endpoint with a text that needs to be classified. The API gateway endpoint then calls an AWS Lambda function which in turn calls the Amazon comprehend endpoint to return the label & confidence score.
  • The instances with low confidence scores are sent to a human to review, named as implicit feedback in the image above. There could be custom feedback as well, which is captured as explicit feedback.
  • The human-annotated dataset is then fed into the retraining workflow, and the retrained model is tested for its performance, and the steps are repeated until desired KPI value is achieved.

For a detailed implementation pipeline, you can refer to the AWS blog.

This is from the AWS SageMaker examples repository. Currently, It’s in the form of a  notebook and creates the resources required for an automated labeling workflow for a text-classification labeling job. Please go through the above link for a detailed understanding of workflow.

Wrapping up!

Active learning in the context of machine learning is all about labeling data dynamically and incrementally with the help of human labelers during the training iterations so that the algorithm can detect from an unlabelled pool which ones would be the most informative for it to learn from.

The application of active learning is causing substantial savings in the cost of annotation as well as performance improvement in certain niche NLP & Computer Vision based models. A lot of interest and research is now happening in the domain of active learning. But it’s also a fact that only a few ready-for-deployment active learning pipelines are available out there. And even if they are utilized, quite a good amount of customization is needed to adapt them to your specific use case. 

So depending upon your expertise and use case, you can either build one yourself by taking inspiration from the concepts and applications discussed above or utilize the commercial tools.

Good luck with your Active Learning journey!

References

Arun C John

Arun C John

Versatile professional with 8+ years of cumulative and diverse experiences across Data Sciences, Machine Learning, BFSI, Oil & Gas and Automotive domains.

  • Follow me on

READ NEXT

Real-World MLOps Examples: Model Development in Hypefactors

6 mins read | Author Stephen Oladele | Updated June 28th, 2022

In this first installment of the series “Real-world MLOps Examples,” Jules Belveze, an MLOps Engineer, will walk you through the model development process at Hypefactors, including the types of models they build, how they design their training pipeline, and other details you may find valuable. Enjoy the chat!

Company profile

Hypefactors provides an all-in-one media intelligence solution for managing PR and communications, tracking trust, product launches, and market and financial intelligence. They operate large data pipelines that stream in the world’s media data ongoingly in real-time. AI is used for many automations that were previously performed manually.

Guest introduction

Could you introduce yourself to our readers?

Hey Stephen, thanks for having me! My name is Jules. I am 26. I was born and raised in Paris, I am currently living in Copenhagen.

Hey Jules! Thanks for the intro. Walk me through your background and how you got to Hypefactors.

I hold a Bachelor’s in statistics and probabilities and a Master’s in general engineering from universities in France. On top of that, I also graduated in Data Science with a focus on deep learning from Danish Technical University, Denmark. I’m fascinated by multilingual natural language processing (and therefore specialized in it). I also researched anomaly detection on high-dimensional time series during my graduate studies with Microsoft. 

Today, I work for a media intelligence tech company called Hypefactors, where I develop NLP models to help our users gain insights from the media landscape. What currently works for me is having the opportunity to carry out models from prototyping all the way to production. I guess you could call me a nerd, at least that’s how my friend describes me, as I spent most of my free time either coding or listening to disco vinyl.

Model development at Hypefactors

Could you elaborate on the types of models you build at Hypefactors?

Even though we also have computer vision models running in production, we mainly build NLP (Natural Language Processing) models for various use cases. We need to cover multiple countries and handle many languages. The multilingual aspect makes developing with “classical machine learning” approaches hard. We craft deep learning models on top of the transformer library

We run all sorts of models in production, varying from span extraction or sequence classification to text generation. Those models are designed to serve different use cases, like topic classification, sentiment analysis, or summarisation.

Continue reading ->



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK