Researchers at MIT, the MIT-IBM Watson AI Lab, IBM Research and elsewhere have developed a new technique for analyzing unlabeled audio and visual data that could improve the performance of machine learning models used in applications such as speech recognition and object recognition. the discovery. The work, for the first time, combines two self-supervised learning architectures, contrast learning and masked data modeling, in an attempt to scale machine learning tasks such as event classification in uni- and multimodal data without the need for annotation, thereby reproducing it. how people understand and perceive our world.
“Most human knowledge is learned in a self-supervised way, because we don’t always get control signals, and we want to enable a machine learning model to have the same ability,” said Yuan Gong, an MIT postdoc. at the Computer Science and Artificial Intelligence Laboratory (CSAIL).
“So, in another way, it can be said that self-supervised learning often forms the basis of the initial model because it can learn on huge amounts of unlabeled data. And then you can use classical, supervised learning or reinforcement learning to tailor the model to something specific if you want,” said Jim Glass, a senior researcher at MIT and a member of the MIT-IBM Watson AI Lab.
The technique, called contrastive audio-visual masked auto-encoder (CAV-MAE), is a type of neural network that can learn to extract and map meaningful hidden images from acoustic and visual data into high-dimensional space by being trained on large YouTube datasets. 10-second audio and video clips. The researchers say the technique is more efficient than previous approaches because it precisely models the relationship between audio and visual data in a way that other methods do not.
Gong and Glass are joined in the research by graduate students Andrew Rudichenko and Alexander H. Liu of MIT, David Harvath PhD ’18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehn. Kuehne is also affiliated with the Goethe University in Frankfurt. The method was recently presented at the International Conference on Learning Representations.
A joint and coordinated approach
CAV-MAE works by “learning by prediction” and “learning by comparison,” Gong says. A masked data modeling or prediction method takes a video along with its systematic audio waveform, converts the audio to a spectrogram, and masks 75 percent of both. The demasked data is labeled and then fed to separate audio and visual encoders before entering a joint encoder/decoder where the model is asked to recover the missing data. The difference between the resulting reconstructed prediction and the original audio-visual combination (reconstruction loss) is used to train the model for better performance. An example of this would be exposing part of a piano video and part of a spectrogram of piano music, then asking the model to try to determine the masked inputs. Unfortunately, this method cannot capture the relationship between a video and audio pair, while adversarial learning does, but may discard some unique information, such as the background of the video.
The goal of adversarial learning is to map similar images to each other. For example, the model will try to place different video and audio data of different parrots closer together and further away from video and audio pairs of guitars being played. Similarly to masked self-encoding, audio-visual pairs are passed to separate weather encoders; However, the audio and visual components are stored separately in the co-encoder before the model performs unification and contrast loss. In this way, adversarial learning tries to identify the parts of each audio or video that are most relevant to the other. For example, if a video shows someone speaking and the corresponding audio clip contains speech, the autoencoder will learn to associate the speaker’s mouth movements with the spoken words. It will then adjust the model parameters so that these inputs are represented close to each other. Finally, the CAV-MAE method combines the two techniques with multiple forward data streams as first step masking, modality-specific encoders, and layer normalization so that the representation powers are similar.
“We [then] “I wanted to compare the proposed CAV-MAE model trained with only masked autoencoder and a model trained only with contrastive learning, because we want to show that by combining masked autoencoder and contrastive learning, we can get some improvement says Gong. , “and the results support our hypothesis that there is a clear improvement.”
The researchers tested CAV-MAE as well as their method against state-of-the-art methods on lossless or masked autoencoder, audio-visual retrieval, and audio-visual event classification tasks using a standard AudioSet (20K and 2M). and VGGSound datasets—tagged, realistic short clips that can include multiple sounds. Audio-visual search means that the model sees the audio or visual component of the query pair and searches for the missing one; Event classification involves identifying actions or sounds in data, such as a person singing or driving a car.
Overall, they found that adversarial learning and masked data modeling are complementary techniques. CAV-MAE was able to outperform previous techniques (with fully self-supervised pre-training) by about 2 percent for event classification performance segment models with comparable computations and, more impressively, kept pace with or outperformed models with industry-level computing resources. The team model was classified in the same way as the models trained with contrast loss only. And surprisingly, the team says, incorporating multimodal data into CAV-MAE pretraining greatly improves single-modality representation refinement via supervised learning (with some labeled data) and performance on audio-only event classification tasks. . This suggests that, similar to humans, multimodal information provides an additional “soft label” stimulus even for audio-only or visual tasks; For example, it helps the model understand whether he is looking for an electric or an acoustic guitar, a signal for richer control.
“I think people like the elegance of this model for combining information across different audio and visual streams. It has contrast and reconstruction loss, and compared to models evaluated with similar data, it clearly performs very well on a number of these tasks,” says Glass.
Based on this, “one special thing is that our model can do both classification and search, which is not common,” adds Gong. “Before this work, these methods were used separately, but after this work, I see most audiovisual learning frameworks using contract loss and masked autoencoder together, either implicitly or explicitly.”
Bringing self-monitored audio-visual learning to our world
The researchers see their contribution to the Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) as an important milestone and advance for applications that are increasingly moving from single-modality to multi-modality and require or use audio-visual fusion. They speculate that one day it could be used for action recognition in fields such as sports, education, entertainment, motor vehicles and public safety. It may also one day be extended to other seasons. At the moment, the fact that “this only applies to audio-visual data may be a limitation, but we are targeting multi-modal learning, which is a trend in machine learning,” says Gong. “As humans, we have multi-sensory capabilities: we have smell, touch, many other things that are just audio-visual. So when we’re trying to build AI, we’re trying to mimic humans in some way, not necessarily from a biological perspective, and this method can; [potentially be] generalized to other unstudied modalities.’
As machine learning models continue to play an increasingly important role in our lives, such techniques will become increasingly valuable.
This research was supported by the MIT-IBM Watson AI Lab.