Deep Neural Networks in Text Classification using Active Learning

13 min readNov 1, 2020

In our modern life, different branches of Neural Networks (NNs) and Natural language processing (NLP) are getting more useful. Natural language processing (NLP) is a process to extract the desired goal from the texts created by translations, speech, auto-captioning, and searching. The classification of the texts is the main role in processing natural language. Active Learning (AL) is a subset in Deep Learning that a model (learning algorithm) can query a user operator to label data while the learning process is going on. In Active Learning, we can increase the learning algorithm’s performance by implementing Neural Networks text classification performance. In this case, the increase in the performance of a model is with the same amount of data or even a decreased amount of data.

In this survey, the text classification processes using Active Learning in Deep Neural Networks (DNNs) will be reviewed. The following two main challenges will be discussed:

The Neural Networks inability to produce dependable uncertainty prediction
The training Deep Neural Networks difficulties when having small data.

Also, a classification of query techniques using Active Learning will be reviewed in this survey. This classification categorizes the sample selection into three different classes based on data, model, and prediction. In the end, the current projects and researches in text classification using Active Learning and the future directions, improvements, and suggestions will be discussed.

1 Introduction

Data is the fuel of apps for machine learning and has thus been growing in importance gradually. There are several unlabeled and undefined data generated in many environments, but one has no option other than having labels to use these data in supervised machine learning. This typically requires a manual marking procedure that is often trivial and may also include a field specialist, for example in the classification of patents or medical documents. Besides, it takes time and rapidly raises financial costs, therefore making this solution quickly unfeasible. And if a professional is available, because of the sheer scale of new datasets it is always impractical to mark any date. In particular, this facilitates the natural language processing ( NLP) area, which can require an enormous data set as well as a great quantity of text in each record. Active Learning (AL) is directed at reducing the volume of data the human expert annotated. It is an iterative continuous cycle of both an active learner and an oracle which is the human annotator. Unlike passive learning, wherein the data is merely supplied to the algorithm, the active learner determines that items will be labeled next.

However, the labeling is indeed carried out by a human specialist, the human being in the cycle. The active learner, after obtaining new labels, trains a new model and starts the method from the outset. In this survey, the structure of the model is based on a query approach and stopping criteria by the word active learner. In this study the working model is a text classification model, the query technique specifies the next cases and establishes the stop criteria were to stop the Active Learning cycle.

Three main scenarios for Active Learning:

Pool-based: the learner has an availability to the closed collection of unlabeled cases, known as the pool.
Stream-based: the learner has the option to hold or release one case at a time.
Membership query synthesis: The learner makes the labeling of new artificial cases. When the pool-based setup does not work on a single case.it is called batch-Mode Active Learning on a batch of cases.

During this study, a pool-based batch-mode method is inferred, as the data set is normally closed during a text classification and the batch-wised feature decreases the number of retraining processes that lead to long waits for the customer. As a result, current Active Learning surveys in some sections have both been not completed and obsolete in others. They have no relation to the existing advanced models, no outcomes for the latest broader data sets, and above all, they have no advancement for NNs and text descriptions.

Interestingly, while NNs are common, there are few researchers in the field of NLP and fewer in the case of text classification on NN-based active learning.

The following may be the reasons for it:

Most DL models need a huge amount of data, which contrasts strongly with Active Learning which expects small datasets as necessary.
The total Active Learning approaches focused on the generation of creating data, which is inevitably much more complicated for text than, for instance, images, in which data augmentation is widely used in classification tasks.
NNs lack uncertain information, which makes the use of a leading class of query approaches more difficult.

2 Active Learning

The purpose of Active Learning is to construct a paradigm using the lowest possible number of labeled cases, for example, to decrease the relation between the oracle (the human annotator) and the active learner.

Figure 1 shows the Active Learning process which is:

Step 1: The oracle sends a request for unlabeled instances to the active learner (query)
Step 2: Active Learner selects and passes the unlabeled instance to the oracle(based on the selected query strategy.)
Step 3: The oracle labels these instances and returns back to the active learner (update).

The active learner’s model will be retrained after each update step. This causes this process to be expensive as much as the process of training the underlying model.

This operation will be repeated and will be stopped if the stopping criterion happens. For instance, if the number of loops reaches a maximum or the classification accuracy has a minimum change.

Figure 1: The Active Learning process

As the box of Active Learner in Figure 1 illustrates the key parts of Active Learner which are Model, Query strategy, and Stopping criterion (optional). The main part for Active Learner is the query strategy which is uncertainty-based.

2.1 Query Strategies

In Figure 2, the most common query strategies of Active Learning are classified based on the input information of a strategy. The input information for this study is classified into four categories:

Random
Data-Based
Model-Based
Prediction-Based

These categories are increasingly diverse and do not preclude each other. The model, of course, relies upon the data and the forecast is based on the model and the data, while a plan uses all of these parameters in many situations. In these instances, the query technique is allocated to the most unique party.

Figure 2: Categorization of query strategies for Active Learning.

In the first step, the main difference lies in the categorization of query techniques by accessing various types of information. We make coherent subclasses from second to the penultimate, with the last level showing examples for the two categories. Because of the proliferation of current query strategies, and because of NLP query strategies, this categorization is not comprehensive.

Random: In several tasks, randomness is typically used as a basis. Random sampling chooses random instances and is a powerful basis for choosing an Active Learning instance. It also competitively applies more advanced strategies, particularly if the labeled pool has expanded.

Data-based: Data-based strategies have the lowest level of knowledge, i.e. they only operate on the raw input data and optionally the labels of the labeled pool. It is categorized into:

1. Strategies: Strategies rely on data-uncertainty. It may use the input information about:

1.1 Data distribution

1.2 Label distribution

1.3 Label correlation.

2. Representativeness: geometrically compact a collection of points, requires lesser descriptive instances to describe the whole specifications.

Model-based: The model-based category of strategy has not only the data but also the model. These methods analyze instances dependent on the metrics of the model. For instance, an estimate of trust will be an indication of how accurate the model rates are for the model to describe the specified instance. This may also be an anticipated number, for instance in case of the severity of the gradient.

Although projections can still be made from the model, we place a constraint on the objective metric being a (measured or expected) amount of the model, without the last prediction. Model-based instability is here a fascinating subclass that uses the uncertainty of the weights of the model. This kind of uncertainty is also known as insufficient evidence uncertainty.

Prediction-based: prediction-based strategies classify instances by evaluating the outcomes of their prediction. Prediction-uncertainty and disagreement-based methods are the most influential participants. This kind of uncertainty is also known as conflicting-evidence uncertainty.

Often there is just a small line between model-based principles and prediction-based uncertainty. In general, prediction-based uncertainty, as contrasted to model-based uncertainty, compares in a classification scope with intraclass uncertainty. In theory, uncertainty sampling commonly means prediction-based uncertainty, except as mentioned.

Ensembles: an ensemble is a combination of the outcome of some other strategies by a query strategy.

Ensembles consist of basic query strategies
Ensembles may be hybrids, for instance, a combination of multiple categories of query strategies. Also, the outcome of ensembles typically depends on the conflict between the individual classifiers.

2.2 Neural-Network-Based Active Learning

For this part, it will be discussed that neural networks in Active Learning applications are not more common and why. This will be focused on NLP techniques.

Two key themes can be applied to this:

Uncertainty estimation in NNs
The contrast of NNs requiring between big data and Active Learning dealing with small data.

Uncertainty in Neural Networks: Uncertainty sampling is among the early and most different forms of methods implemented. Regretfully, this definition is not readily applicable to NNs because it does not have an intrinsic measure of vulnerability. This was discussed before, among others, in the ensembling, or the estimation of learning errors. In comparison, newer methods use Bayesian extensions, achieve uncertainty estimates by drop-out, or use probabilistic NNs to measure uncertainty. Bayesian and ensemble methods are getting rapidly, nevertheless, inaccessible in huge datasets and NN designs are commonly considered to be overconfident. Subsequently, ineffective responses to complexity in NNs continue to be the dominant important area of study.

Contrasting Paradigms: DNNs are especially well recognized in large data sets, but the availability of large volumes of data is a strict prerequisite for effective results. Active Learning seeks to minimize the need for the data that are labeled. DNNs could be troublesome since limited data sets are considered to overfit, resulting in poor generality output in the test set. Also, DNNs sometimes give tiny benefits on shallow models once they are trained to utilize small datasets, for higher computational costs with no justification. And from the other side, it is apparent that we can not demand that Active Learning mark more data as that would compromise its target. Research has also been performed on (D)Ns utilizing small datasets, but in particular in contrast to the vast volume of NN-literature generally, it is just a small number. Small databases are usually eliminated when using pre-training or other means of transferring learning. In the end, the finding for optimal hyperparameters is frequently ignored, and instead, the hyperparameters of the relevant work are used, that are optimized with large datasets if that is even.

3 Active Learning for Text Classification

In this section, the recent advancement in text classification and NNs will be discussed.

3.1 Recent Advances in Text Classification

Representations: The classical methods implement the representation of the bag-of-words (BoW). BoW representations are high-dimensional and sparse. On the other hand, the new representation in word embeddings such as word2vec, GloVe, or fastText replaced BoW representations.

The following may be the reasons:

They describe semanticized relations within vectors and escape the issue of inconsistency due to synonyms, for instance.
Through the incorporation of word embedding, several downstream tasks worked better.
Word vectors are low-dimensional and dense representations in contrast to bag-of-word, which make them ideal for a broader array of algorithms-especially in the sense of NNs that prefer constant size inputs. Different methods to achieve equivalent representations of fixed sizes for word sequences, such as sentences, paragraphs, or documents were presented.

Neural-Network-Based Text Classification: A famous KimCNN architecture uses pre-trained word vectors and only uses basic but innovative architecture to produce a state-of-the-art performance at the moment. In the CNN configurations tested, much hyperparameter tuning was not necessary and the efficacy of dropout was verified as a regularizer for a CNN text classification.

3.2 Text Classification for Active Learning

Classic Active Learning for text classification was heavily focused on prediction-uncertainty and ensembling. Popular models contained Support Vector Machines(SVMs), Naive Bayes, logistic regression, and neural networks. However, Olsson has covered a large ensemble-based Active Learning for NLP in detail, according to recent research, no prior survey covered classical Active Learning for text classification. Concerning current text classification NN-based Active Learning, the applicable models are mainly CNN- and LSTM-based deep architectures.

3.3 Commonalities and Limitations of Previous Experiments

Table 1 displays the new Active Learning studies for text classification which are all somewhat recent than the Settles and Olsson surveys. This table is provided to learn about the newly chosen classification models and query technique classes.

Table 1: Text classification recent work on Active Learning.

Models in Table 1:

Naive Bayes (NB)
Support Vector Machine (SVM)
k-Nearest Neighbours (kNN)
Convolutional Neural Network (CNN)
[Bidirectional] Long Short-Term Memory ([Bi]LSTM)
FastText.zip (FTZ)
Universal Language Model Fine-tuning (ULMFiT).

Query strategies in Table 1:

Least confidence (LC)
Closest-to-hyperplane (CTH)
expected gradient length (EGL)

Table 2: The short keys of a collection of widely-used text classification datasets.

Table 2 shows the short keys text classification datasets that are commonly using. The column “Type” shows the classification setting:(B = binary, MC = multi-class, ML = multi-class multi-label). Based on Table 1, It is clear that a vast majority of such query strategies belong in particular, to query strategies of the prediction-uncertainty and disagreement- based sub-classes.

4 The Survey Outcomes

Uncertainty Estimates in Neural Networks: In collaboration with NN models, uncertainty-based strategies were successfully utilized, and the most critical aspect of query strategies in the latest NN-based Active Learning has been discovered. Because of inaccurate uncertainty estimates, or restricted scalability, the uncertainty in NNs is still challenging.

Representations: The implementation of NLP text representations has progressed from bag-of-words to text embedding. These representations bring numerous benefits, including non-sparse vectors, disambiguation capabilities, and accuracy improvements for several tasks. There is no AL-specific structural assessment that contrasts embedding word and LM with NNs, although certain implementations have existed. Also, they are currently rarely used and suggest either a sluggish implementation or any functional problems which are not investigated.

Small Data DNNs: In large datasets, DL methods are typically used. Active Learning plans to keep the data collection as small as necessary, though. Small data sets were explained why they could challenge DNNs and also DNN- based Active Learning as a direct result. This dilemma is eased to some extent by the use of pre-trained language modeling as fine-tuning enables the use of slightly smaller datasets in training models. Also, it was analyzed how small data are still required to fine-tune a model.

Comparable Evaluations: A summary of the most popular Active Learning text classification techniques has been presented. The mixes of data sets used during the studies are sadly entirely disjointed. As a result, the comparability of the new and previous work is reduced or even lost. However, Compatibility is a key to verify if previous observations into shallow NN-based Active Learning still implement DNN-based Active Learning content.

Learning to Learn: There are lots of query strategies, which were classified non-exhaustively. This raises the issue of selecting the best strategy. Several variables, such as data, model, or task, depending on the correct choice and which vary between the various processes during the Active Learning process. This means that learning to learn (or meta-learn) has become popular and can be used to learn the best option, or also to learn query strategies in general.

5 Conclusions

In this study, text classification with (D)NN-based Active Learning and factors which hinder their adoption were discussed. By focusing on data-based, model-based, and prediction-based input information, a taxonomy was built to distinguish query strategies. For text classification, we examined questioning techniques used in Active Learning and categorized them into the related taxonomy classes. The intersection of Active Learning, text classification, and DNNs was presented. Besides, (D)NN-based Active Learning was analyzed, and existing issues and the state of the art were identified and pointed out. Also, related recent innovations in NLP were presented and compared to Active Learning and demonstrated deficiencies and constraints on their use. One of the key results is that uncertainty-based query strategies remain the class most used unless the study is limited solely to NNs. The representations based on language models provide more comprehensive representations of a specific context while managing out-of-vocabulary words. Also, we find that advanced transfer-learning reduces to some degree the small data challenge, but does not take it. The most significant DNNs have explained promising results with their success in various tasks and the initial adoptions in Active Learning. For Active Learning, it would be quite desirable to make these benefits. It is therefore vital to promote the adoption of DNNs in Active Learning, particularly as planned output increments may be used to either improve classifications while using the same amount of data or to improve labeling process efficiency by reduction of data and hence attempts to label them. Based on these results, research directions were defined for future work to drive (D)NN-based Active Learning advancements. As discussed, it can be suggested that the learning to learn technique or meta-learning has become more developed and common and will be used to learn with higher performance.

6 Reference:

This article is a short story of the following paper:

C. Schröder and A. Niekler, “A Survey of Active Learning for Text Classification using Deep Neural Networks,” arXiv.org, August 17, 2020. [Online]. Available: https://arxiv.org/abs/2008.07267 (Accessed: October 05, 2020).

Deep Neural Networks in Text Classification using Active Learning

Written by Mirsaeid Abolghasemi