Understanding speech and scene with ears and eyes

Bidragets beskrivning

One of the biggest challenges of AI is to develop computational abilities to understand speech and video scenes as effectively as we humans do it. This project aims to develop multimodal techniques for understanding and interpreting aural and visual inputs. These novel machine learning based techniques will first learn representations of visual stimuli and human speech in various abstraction levels and then cross-modal correlations between the representations. This can be achieved by devising new network structures and utilizing diverse uni- and multimodal datasets for training of the various parts of the model first separately and then jointly. As a result, we believe, the accuracy of both speech recognition, visual description and interpretation will improve.

Visa mer

Startår

2022

Slutår

2024

Beviljade finansiering

Jorma Laaksonen

Aalto-universitetet

326 920 €

Rollen i Finlands Akademis konsortium

Övriga parter i konsortiet

Leader

Aalto-universitetet (345790)

329 586 €

Finansiär

Finlands Akademi

Typ av finansiering

Akademiprojekt med särskild inriktning

Utlysning

ICT 2023: Frontier AI Technologies 2021

Övriga uppgifter

Finansieringsbeslutets nummer

345791

Vetenskapsområden

Data- och informationsvetenskap

Forskningsområden

Laskennallinen data-analyysi

Identifierade teman

languages, linguistics, speech