What is the difference between labeled and unlabeled data?

Machine Learning Problem Overview

In this video from Sebastian Thrum he says that supervised learning works with "labeled" data and unsupervised learning works with "unlabeled" data. What does he mean by this? Googling "labeled vs unlabeled data" returns a bunch of scholarly papers on this topic. I just want to know the basic difference.

Machine Learning Solutions

Solution 1 - Machine Learning

Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.

Solution 2 - Machine Learning

Labeled data, used by Supervised learning add meaningful tags or labels or class to the observations (or rows). These tags can come from observations or asking people or specialists about the data.

Classification and Regression could be applied to labelled datasets for Supervised learning.

Machine learning models can be applied to the labeled data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted.

Unlabeled data, used by Unsupervised learning however do not have any meaningful tags or labels associated with it. Unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected.

Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.

Unsupervised learning has fewer models, and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate. As such, unsupervised learning creates a less controllable environment as the machine is creating outcomes for us.

Picture courtesy of Coursera: Machine Learning with Python

Solution 3 - Machine Learning

There are many different problems in Machine Learning so I'll pick classification as a case in point. In classification, labelled data typically consists of a bag of multidimensional feature vectors (normally called X) and for each vector a label, Y which is often just an integer corresponding to a category eg. (face=1, non-face=-1). Unlabelled data misses the Y component. There are many scenarios where unlabelled data is plentiful and easily obtained but labelled data often requires a human/expert to annotate.

Solution 4 - Machine Learning

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. For example, labels might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.

Solution 5 - Machine Learning

We can say that labeled is that data which is well defined. Eg. Emails, IP addresses,etc. Whereas unlabeled data is something which is not properly defined. Eg. Nature patterns, migration patterns of birds, etc. Unlabeled data alone does makes any sense but labeled data alone can be understood.

Solution 6 - Machine Learning

In order to better answer your question, let's first define what is training data, "Training data just means the prepared data that's used to create a model."

Now let's define what is labeled or supervised learning: "The value you want to predict is actually in the training data." It means that each record from training data contains all the necessary information (features and target value as well).

Unlabeled or unsupervised learning: "The value you want to predict is not in the training data."

Side note: Both approaches are used, but it's fair to say that the most common approach is supervised learning.

Solution 7 - Machine Learning

In unlabeled data, there is no target value (dependent variable). We use unsupervised machine learning models to generate a target/dependent variable, which is basically grouping similar data together as clusters.

Content Type	Original Author	Original Content on Stackoverflow
Question	bernie2436	View Question on Stackoverflow
Solution 1 - Machine Learning	lmjohns3	View Answer on Stackoverflow
Solution 2 - Machine Learning	Nava Bogatee	View Answer on Stackoverflow
Solution 3 - Machine Learning	John Greenall	View Answer on Stackoverflow
Solution 4 - Machine Learning	Souravi Sinha	View Answer on Stackoverflow
Solution 5 - Machine Learning	Shashwat Pandey	View Answer on Stackoverflow
Solution 6 - Machine Learning	Muhammad Waqas Dilawar	View Answer on Stackoverflow
Solution 7 - Machine Learning	Krishna Gannamaneni	View Answer on Stackoverflow