This AI model DenseAV efficiently learns to speak by watching videos

AI model DenseAV is now an apprentice of different word meanings with sound location without any human input
An representational image. — Unsplash
An representational image. — Unsplash

Another artificial intelligence (AI) model, DenseAV, has learned the meanings of different words and sound locations without any human input, simply by watching content, according to recent reports.

Microsoft, Oxford, and Google explained in a paper how DenseAV achieves this through self-supervision from videos. To master this task, it uses audio-video contrastive learning, associating specific sounds by contrasting pairs of audio and visual signals to identify essential data.

DenseAV then evaluates these signals to find matches, making it easier to predict visual elements from audio cues. It learns without labels by comprehending language and recognising sounds.

How does it work?

Previously, researchers focused on pixels, where a model would consistently analyse the sound of a "cat" and search for cats in the video. Similarly, DenseAV listens to someone saying "cat" and a cat's voice. This enables the AI to identify the image of a "cat" in a single shot.

Researchers describe DenseAV as having a “two-sided brain”: one side is highly attentive to language, while the other focuses on sounds. This dual focus allows DenseAV to learn the meaning of any word without human input.

Why is this useful?

DenseAV is an entirely unsupervised algorithm that learns the meanings of different words and the locations of sounds solely by watching videos. Its potential is vast, as it can be trained on instructional videos to perform daily tasks autonomously. This technology, which surpasses human capabilities in some respects, holds significant promise for the future.