What is the difference between labeled and unlabeled data?

Machine Learning

Machine Learning Problem Overview


In this video from Sebastian Thrum he says that supervised learning works with "labeled" data and unsupervised learning works with "unlabeled" data. What does he mean by this? Googling "labeled vs unlabeled data" returns a bunch of scholarly papers on this topic. I just want to know the basic difference.

Machine Learning Solutions


Solution 1 - Machine Learning

Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.

Solution 2 - Machine Learning

Labeled data, used by Supervised learning add meaningful tags or labels or class to the observations (or rows). These tags can come from observations or asking people or specialists about the data.

Classification and Regression could be applied to labelled datasets for Supervised learning.

Machine learning models can be applied to the labeled data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted. enter image description here

Unlabeled data, used by Unsupervised learning however do not have any meaningful tags or labels associated with it. enter image description here Unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected.

Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.

Unsupervised learning has fewer models, and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate. As such, unsupervised learning creates a less controllable environment as the machine is creating outcomes for us.

Picture courtesy of Coursera: Machine Learning with Python

Solution 3 - Machine Learning

There are many different problems in Machine Learning so I'll pick classification as a case in point. In classification, labelled data typically consists of a bag of multidimensional feature vectors (normally called X) and for each vector a label, Y which is often just an integer corresponding to a category eg. (face=1, non-face=-1). Unlabelled data misses the Y component. There are many scenarios where unlabelled data is plentiful and easily obtained but labelled data often requires a human/expert to annotate.

Solution 4 - Machine Learning

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. For example, labels might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.

Solution 5 - Machine Learning

We can say that labeled is that data which is well defined. Eg. Emails, IP addresses,etc. Whereas unlabeled data is something which is not properly defined. Eg. Nature patterns, migration patterns of birds, etc. Unlabeled data alone does makes any sense but labeled data alone can be understood.

Solution 6 - Machine Learning

In order to better answer your question, let's first define what is training data, "Training data just means the prepared data that's used to create a model."

Now let's define what is labeled or supervised learning: "The value you want to predict is actually in the training data." It means that each record from training data contains all the necessary information (features and target value as well).

Unlabeled or unsupervised learning: "The value you want to predict is not in the training data."

Side note: Both approaches are used, but it's fair to say that the most common approach is supervised learning.

Solution 7 - Machine Learning

In unlabeled data, there is no target value (dependent variable). We use unsupervised machine learning models to generate a target/dependent variable, which is basically grouping similar data together as clusters.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionbernie2436View Question on Stackoverflow
Solution 1 - Machine Learninglmjohns3View Answer on Stackoverflow
Solution 2 - Machine LearningNava BogateeView Answer on Stackoverflow
Solution 3 - Machine LearningJohn GreenallView Answer on Stackoverflow
Solution 4 - Machine LearningSouravi SinhaView Answer on Stackoverflow
Solution 5 - Machine LearningShashwat PandeyView Answer on Stackoverflow
Solution 6 - Machine LearningMuhammad Waqas DilawarView Answer on Stackoverflow
Solution 7 - Machine LearningKrishna GannamaneniView Answer on Stackoverflow