One perspective to think about Deep learning is in terms of Representation learning (or feature learning). As compared to other AI methods where we handcraft the representations onto our algorithms, deep learning allows us to learn representations automatically from data. But current popular deep learning algorithms rely on manually-labeled dataset for learning representations (supervised learning).
Now the question arises in our minds, “How can we make the algorithms to learn representations from unlabeled data?” Self-Supervised Learning (SSL, in short) is a method for learning representations without human supervision (a.k.a. From unlabeled data). Cool, now you get it, what self-supervised learning is?
Now, another question arises, “what are we gonna do about those learned representations?” So, we can take that model which has learned representations to further train it on the down-stream task that we care about. Sometimes, people may refer to a self-supervised learning task as a pretext task. This is just as similar to Transfer learning, but the difference is instead of just pre-training the model using labeled data, we’re pre-training using unlabeled data via SSL.
“How can this be helpful?” Self-supervised pre-training is suitable for tasks which have less labeled data. At the same time, the model can converge quickly and perform better than the model which hasn’t pre-trained.
Approaches to SSL:
There are three main approaches for Self-supervised learning.
1. Self-Prediction — We’ll mask some part of the data sample and ask our model to predict the masked part, so we optimize the model to predict the masked part accurately.
2. Contrastive Learning — Our goal here is to learn representations such that positive sample pairs lie close to each other and negative sample pairs lie apart from each other in the Latent space.
3. Non-Contrastive Learning — Our goal here is to learn representations such that positive sample pairs lie close to each other in the latent space and there is no need for negative sample pairs to train the system.
Success of Self-Supervised learning :
Self-Supervised pre-training had a long-standing success in NLP, and more recently in Speech Recognition and Computer vision.
I picked a particular method called DINO, to show you how powerful SSL could be. The DINO method comines Self-supervised pretraining and Vision Transformers (ViT) for learning visual reprsentations. DINO shows the capability for segmeting objects from the foreground without being given the segmentation-targeted objective during training.
Future of SSL :
The SSL methods are different across language,vision and speech. But there has been a recent framework called data2vec to generalize the pre-training method across those modalities, and it is also shown to perform better than those modality-specific counterparts. So there’ll be further advances in generalizing self-supervised pre-training across all the modalities. And there is also promise for self-supervised learning methods for learning representations from multi-modal data simultaneously.
In relation to Natural Intelligence :
An average human is able to learn to drive a car in about 20 hours of practice. But, a self-driving car can’t drive even with thousands of hours of data. So what’s innate about Humans and animals, in general that allows us to learn quickly. The answer probably is, Humans have prior knowledge ( of how the world works). For example, If there is a cliff around the corner, we know that turning the wheel along that direction will kill us, but a self-driving system needs to learn this concept of support/gravity from scratch. How can we make machines to learn this prior knowledge of how our world works? SSL is just one of the answers to this question. Humans actually acquire this background knowledge when they’re babies. In the first few months, babies are only able to observe the world, though they are able to learn a ton about our world.
Self-supervised learning is about making machines to learn like babies.
The standard paradigm of machine learning is task-centric. We’re training our models just to perform a particular-task. But, if we look at how humans are brought up, first they’re allowed to be general without supervision, then they learn to perform a task.
Further Resources:
[1] Self-supervised learning: The dark matter of intelligence
[2] Neurips 2021 Tutorial -Self-Supervised Learning: Self-Prediction and Contrastive Learning
[3] Energy-Based Learning Model talk by Yann LeCun
PS : Feel free to connect with me on Twitter, Medium, Github.