Expanding labeled datasets through correlated features

10 min readJul 22, 2019

The challenge of obtaining labeled data

One of the biggest challenges of developing a good classification model is usually getting a large amount of labeled data to ensure it can generalise properly in production. Once you have decided that you want to create a model to detect anything from phishing attacks to fraud attempts, the realisation that you first need to get enough ‘labeled’ data can become a chicken and egg situation since, if you already had plenty of this data labeled, it would mean that you were catching the attacks through some other means.

An approach that many organisations take when faced with this situation is to get analysts or third parties (such as Amazon’s Mechanical Turk with privacy protecting measures taken if needed) to review the data and add the labels manually, a process which depending on how much data you might needed correctly labeled, could make the whole project infeasible.

After dealing with this problem in the past I’ve found that a way to go from no labeled data (or a small amount) to a rich and generalised dataset of labeled data was possible with minimal analyst work through a process of extrapolation based on features of the dataset that might be related to the labels we are interested in.

For example if you had a model that was able to classify phishing sites based on their html, you might find a correlated feature to a subset of phishing sites which might belong to a subset of the WHOIS information for the site (such as the email of who registered it). The extrapolation process would allow to find other phishing sites of the same owner, learning their html structure and allowing the model to predict others which are similar.

Expanding labels

If for example you want to be able to correctly classify questions on travel.stackexchange.com as related to a visa question one might go through many samples of questions adding a label when they are related to a visa permit, but with datasets of millions of queries getting enough generalisation on how people might write about the topic can be daunting. This can become specially difficult when dealing with hundreds of thousands of entries to classify into hundreds or thousands of labels. In many cases an approach to solve this is building a model based on an initial subset of labeled data and then use the model to find more data related to the label.

Creating a model to label more data in its dataset by generalising from the original labeled data

While this will expand the amount of labeled data available, it will not be as effective in improving the predictions of real world data. For example if the model learned to look at questions with the ‘visa’ keyword, by expanding the dataset to other questions that have the same keyword it might not learn different ways of asking the same question (such as asking about a passport or a stay permit).

In short, you might expand the amount of labeled data, but not the vocabulary that refers to the label you are interested.

Using correlated features to expand labeled data

A way to solve the problem of not improving how the model generalises would be expanding the dataset not taking into account only the questions but features which strongly correlate to the label we are interested in. If we create a model based on the answers to questions we know are related to the label, and then use it to find other questions that might relate to it, unlike in the previous example where we create a model based on already tagged questions, we would be able to expand the vocabulary of ways of asking for the label which were not contained on the original labeled data.

Expanding a dataset by looking at correlated features to the label allows to expand the vocabulary of the model

For example if we had labeled a dataset by just looking at the word “visa” (since we are interested in the ‘visa’ topic) and then used its answers to find other answers such as:

penalties are levied via a fine the fine ranges from around 1000 ntd (if you overstay by an hour up to 10000 ntd source (page 5 http //iff immigration gov tw/public/data/11714474471 pdf

A model using those answers to find other questions of the same topic might learn something outside of our initial queries, which was that discussions about visas include answers about possible penalties. Then building a model with those answers to find questions which we did not know were related to the visa topic, can result in finding questions such as:

can i visit uk with italian stay permit

Which might not have any relationship with our original labeled dataset, going from a generalisation of how people might use the word ‘visa’ in a sentence and some supporting words to finding out that people asking for a ‘stay permit’ might also relate to the same topic.

While in this example the correlated feature is quite straightforward to consider (the answer to the question), this is not always the case. Such a feature could be composed of several different ones which can together help identifying the label we care about.

In this situation, you would both be expanding the amount of label data and the vocabulary to improve the ability to generalise to real world examples.

Ensuring the quality of the labeled data

One of the risks of this approach of creating models based on correlated features (or even adding labels to data based on a model from an original set of labeled data) is that prediction errors can get propagated and grow through each iteration of tagging more data based on the model (since each dataset would have more and more incorrectly labeled data). While some might try to focus on only labeling documents that have a very high probability of belonging to the class the model tries to predict (which in turn means that the model will learn little from the new labels), an approach I have found is to take a semi-supervised take on the problem which allows to expand the amount of data labeled while also learning new ways to predict the label.

Sampling clusters for acceptance of new labeled data

By clustering the data that we were able to predict as a particular label (for example based on the answers) we would be able to quickly assess by sampling elements in each cluster if the label was correct or incorrect, and based on that decision provide the label to all the documents in the cluster. This allows us to select samples close to the classification boundary (which we can decide with a threshold of our choice) and then decide to add them or not in a process which could be considered semi-supervised.

This process ensures that every time we are using a model either to expand on the same vocabulary (by looking at questions) or learning new ways of asking (by looking at questions that had answers similar to the expected for the label) we are evaluating the results by looking at samples of clusters to speed up the analysis

This means that by following the previous process we are able to not only expand our labeled dataset, but also improve our capacity to generalise to ways of asking about the label which were previously unknown and unrelated to the original labeled data. Clustering the newly found data to label allows to only sample a small subset of the data before accepting all the cluster which can dramatically speed up the process.

Keep in mind this will only work if the correlated feature you are using has a strong relationship to the label of the data. In other words, you would usually expect to be able to tell what was the question for an answer if you were only given the latter.

Case study: travel.stackexchange.com

Let’s perform a small example of this process, using the available dataset for https://travel.stackexchange.com which is shared in the Internet Archive. For that I created this Jupyter Notebook which you can find here.

Getting the dataset

First we will download the dataset and after performing some cleaning and we would take the id, title of the question and the accepted answer. We will add a column for the classification (which will start as unknown for all the rows) and who classified it.

Initial pandas dataframe without any labels

Once we have the dataframe, we would need some labeled data. In this case to pretend we already got a labeled dataset I will perform a very naive approach to getting labeled data for the ‘visa’ topic (are people asking about a travel Visa). To achieve that I will label any question which has a title with ‘visa’ in it with the ‘visa_question’ label.

In this image we have added a label of “visa_question” to any row with a title containing the string “visa”. The displayed rows are all titles that should belong to the visa_question topic

Building a model based on question titles

Since we have our initial labeled dataset, we can create a simple classification model that will learn to differentiate between questions of the visa topic and those which do not belong to it based on the content for the question title. For that I will be using the scikit-learn SGDClassifier module (which uses SVM as the linear classifier)

The results of validation for a classification model trained to discover questions related to the “Visa” topic. The score is unrealistically high since we created it by labeling all questions with the ‘visa’ keyword in them.

By now we would have the model created based on the question titles. This means we could attempt to use it to find more questions which might belong to the visa topic but were not labeled as such. As mentioned above the problem with this approach is that we will be limited to the vocabulary already contained in our original model. To avoid labeling incorrectly we would need the classifier to predict its likelihood of being part of the label as extremely high.

Some documents which the classifier thought might also belong to the label. For example the first one was incorrectly labeled

Using correlated features to expand our vocabulary

One strongly correlated feature to the question title would be the accepted answer. In this case we could expect that answers related to the visa_question label would have similarities between them which we could use to predict other questions related to the visa topic.

Some answers for questions classified as part of the “visa_question” label

If we look a bit into the answers of the topics we might be able to find some patterns that we hope our models to learn, to achieve that we will perform a similar process as we did when we created a model based on the question titles, but this time focused on the question answers. This would be a classification model that learns how the answers for the visa topic look like so it can predict which other answers might belong to the same topic (and by transitivity we would expect their questions too).

Results of the validation from a model created by looking at the answers for questions labeled as ‘visa_question’.

With this model available, now we can go through the questions we have not been able to label yet and classify them as the ‘visa_question’ label if the prediction of it belonging to the class is above a threshold we will determine.

While the model was able to find these entries based on their answer, we can see some questions relate to a visa, even if the question title did not have language that would have allowed our initial models to predict it

We can also explore in this case some of the new vocabulary we were able to learn based on the model created using the answers to questions.

In this case through answers we were able to learn the phrase “layover flight between” might relate to questions about visa requirements

Reducing errors when expanding our dataset

As mentioned above, the previous process can lead to classifying incorrectly the labeled data, in a process that can keep expanding the errors in each iteration. To overcome this we can go through the semi-supervised approach of clustering the results before we apply a label to them. We can for example use a hierarchical clustering algorithm such as AgglomerativeClustering to cluster titles which might relate to each other, so that by sampling only a subset of them we could approve the whole batch of data to label. This will not remove the possibility of adding incorrect labeled data, but it can reduce it enough to allow the model to improve even if some of the data is incorrectly labeled.

Two clusters of messages that were found through the model which was trained by looking at answers to questions. We could take random samples of the cluster and evaluate if they relate to the visa label, if they do add the label to all the questions related to the cluster or discard them.

While this is a small example, a cluster might have thousands of documents which are all similar on nature and could allow an analyst to quickly review samples of them before making a decision. After some clusters are accepted, the process can be restarted as many times as needed until the performance is reduced, and one way of measuring this would be on the clusters acceptance rate. Another benefit of sampling clusters for acceptance, is that you can add to the process the labeling of clusters as sub-topics, which can benefit from the same process of label expansion through the dataset.

If you want to check the Jupyter Notebook I used for this analysis you can find it here. It will require downloading the archive for https://travel.stackexchange.com and then placing the Posts.xml file under the travel_questions/Posts.xml.

Considerations:

There was no attempt at finding the optimal hyperparameters or using the best possible model.
The dataset used is quite small, the approach described here would have better results with large datasets where multiple different topics need to be classified.
I expect I did not discover this strategy so if you have examples of this being used or papers about it I would appreciate if you can leave a comment with a reference to them!