Data2vec 2.0 by Meta: the second time is faster

Meta’s Data2vec is an example of a generalist neural network that can use the exact same code to analyze sample data in different modalities—in this case, voice, text, and images—and make predictions on that data. Baevski et al.
What do you do when you’ve proven your point in neural networks?
Make it faster is an answer.
On Tuesday, Meta Properties, the owner of Facebook, Instagram and WhatsApp, unveiled Data2vec 2.0, a revamp of a neural network presented earlier this year which behaves like a sort of generalist, performing tasks involving text, image, and voice data with the same basic approach for all three.
The second time, the Meta scientists made the program faster and, in some cases, more accurate when benchmarking machine learning tasks.
“Data2vec 2.0 shows that the training speed of self-supervised learning can be substantially improved without losing the accuracy of subsequent tasks,” write authors Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli, four of the original authors. Data2vec’s article, in this new work, Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language, published in arXiv.
Also: What is ChatGPT and why is it important?
The singular achievement of this second Data2vec is to reduce the time it takes to train Data2vec. The training of a neural network is usually measured in terms of “epochs”, that is, the number of times the neural network is given training examples. It can also be measured by wall clock time, the literal hours, minutes, and days counted from start to finish.
“Experiments show that Data2vec 2.0 can achieve the same accuracy as many existing popular algorithms at 2-16 times the training speed,” they write.
The name Data2vec is a play on the name of a language “embedding” program. developed at Google in 2013 called Word2vec. That program predicted how words are grouped, so Word2vec is representative of a neural network designed for a specific type of data, in that case text.
However, in the case of Data2vec, Baevski and his colleagues are taking a neural network called Transformer, developed by Ashish Vaswani and his colleagues. at Google in 2017and extend it to be used with various data types. The same structure of the neural network can be used to train all three (image, voice and text) without being altered to adapt to the particularities of any of them, which makes it a generalist program.
Baevski and his colleagues extend the Transformer to what is called “self-supervised” learning. In a self-monitored environment, a neural network is trained by having to go through multiple stages whose results are compared with each other.
First, the network compresses a sample of data, which is known as constructing an input data representation. Then a second version of the network has some of those input data elements “masked”, without revealing them. It has to rebuild the representation that the first version of the network had built, forcing the second network to build a better model of how the data fits together by essentially filling in the blanks.
Also: The true goal of AI may no longer be intelligence
The two networks, the one with the compressed representation of the complete, unmasked input data, and the one with the incomplete version that it tries to complete, are called, sensibly enough, Teacher and Student, respectively. The Student network tries to develop its sense of data, so to speak, by reconstructing what the Professor has already achieved despite the masking.
This time, the authors made two key changes to Data2vec to make it faster: use “convolutions” and “amortize” the compressed representations of the master network.
At the first score, the student network that has to predict the teacher’s representations no longer uses the part of the Transformer called the decoder to do so.
That’s the standard approach, to uncompress, in a sense, the compressed representations of the teaching network. Instead, the authors use what are called convolutional neural networks, a fundamental tool in neural networks for representing data samples in a compressed form, and a tool that is much older than the Transformer. It’s a good example of how older technology can stay in programming.
“Instead of using a transformer-based decoder, we used a smaller convolutional decoder, which we found to be easier and faster to train,” they write.
For the second change, instead of repeatedly creating a compressed representation in the teacher network, the new Data2vec creates the representation only once. It then reuses it as the target, which is to be guessed, for each of the masked data points.
As the authors say, “To amortize the cost of calculating the teacher model, we reuse the teacher representation for multiple masked versions of the training sample.
“Specifically, we consider M different masked versions of the training sample and calculate the loss with respect to the same target representation.”
The architecture of Data2vec 2.0. Meta this time has replaced the second part of the program, which had been a Transformer-based decoder, with a decoder that is based on convolutional neural networks, an older technology. They also reused the compressed representations of the “teacher” network as a single target for multiple masked instances of the “student” network data. Baevski et al 2022
In the results section of the paper, Baevski and his team report how they reduced training time and improved accuracy in the three domains of image recognition, speech recognition, and natural language processing.
For image processing, the authors used Data2vec as the basis for fine-tuning what is called “ViT”, the “Vision Transformer”, a neural network designed specifically for vision tasks that was introduced last year (PDF) by Alexey Dosovitskiy and colleagues at Google. The Data2vec program is a pretrained base, on which ViT is a fine-tuning, in terms of the literature.
Compared to the January results, Data2vec-powered ViT once again outperformed other neural networks used as the foundation for ViT in terms of accuracy on ImageNet, the classic label-to-image test, and also outperformed the previous version of Data2vec.
But besides the precision, the new Data2vec took much fewer times to train. The previous Data2vec took 800 times; this time, that was reduced to 150 epochs. And coupled with a competitive self-monitored network, masked auto-encoders or MAE, another creation of Meta (PDF), training was reduced from 1600 epochs to 100, even as the accuracy of the new Data2vec outperformed MAE. The faster training regimen results in a large reduction in absolute training time, just 66 hours for Data2vec 2.0 vs. 113.6 hours for MAE.
Also: Artificial intelligence: 5 innovative applications that could change everything
In speech recognition, the task is to fill in the missing parts of an audio file fragment of a spoken phrase. The new Data2vec was up against multiple competing neural networks for speech, including the original data2vec and programs called Wav2vec, HuBERT, and WavLM. In no case did Data2vec 2.0 beat those networks, but it “achieves higher accuracy than other models at a faster training time.” For example, 43 hours of training Data2vec 2.0 achieves an accuracy that requires 57 hours for the original Data2vec.
In the third field, natural language processing, Data2vec 2.0 was tested across a spectrum of challenges comprising the general language comprehension assessment framework, known as GLUE, developed by the Courant Institute of Mathematical Sciences at the University of NY. in 2019.
In one test, the network has to predict whether one sentence is derived from another (logical linking), while another representative task challenges the network to label a sentence as grammatically correct or not.
Taking on the original Data2vec, plus two Transformer-based programs, Google’s BERT and a revised version, called RoBERTa, introduced in 2019 by the Paul Allen School of Computer Science at the University of Washington and Meta, version 2.0 of Data2vec scores excellently in GLUE results and is faster to train.
The total average accuracy score across all GLUE tasks for this new version is 82.6, just a fraction below the 82.7 of the original Data2vec, but higher than BERT’s 81.2 and higher than RoBERTa’s 82.5. But Data2vec 2.0 takes just 28.2 hours to reach that level, less than half the 69 hours it took for the original Data2vec and much less than the 50.5 hours it took Roberta.
Also: The people who build artificial intelligence need AI the most
Baevski and his team write that they will extend Data2vec in the future to other forms of data beyond speech, image, and text, raising the possibility that it could be even more generalist.
One limitation is likely to remain in place. As with the original Data2vec, version 2.0 still handles each type of data differently when it is first fed into the network during training. That means Data2vec hasn’t yet developed a completely generic way of handling data types.
Image, speech and text are prepared by preprocessing the data. In that way, the multimodal aspect of the network is still based on clues about the data, which the team refers to as “small modality-specific input encoders.”
In addition, each of the compressed teacher network encodings is created separately for the three types of data. There is not yet the ability to create a kind of “supercoding” that combines all data types at once into a single representation.
And so, just like with Data2vec 1.0, a neural network that really could be one network to rule them all is still the technology of the future.
As with the original Data2vec, Meta has posted the code on github.