Unseen Languages in Language Identification

Bidragets beskrivning

One of the most challenging issues in language identification is the handling of texts written in unseen languages. Systems ground their predictions on training corpora for a finite number of languages. Almost exclusively, the methods label whatever text they encounter as one of the languages in their repertoire. If they encounter text written in an unknown language, they label it with the language they deem closest. The results can vary from the indicated language being a close relative to a seemingly random choice. The handling of unseen languages was already 2006 stated as an outstanding issue, but it still remains an issue without any real solutions. We have gathered a highly skilled group of collaborators with whom we will inspect several case studies where unseen languages pose practical problems for researchers or the users of language resources created by them. We aim to significantly improve the understanding of the phenomenon and the methods used to handle it.
Visa mer

Startår

2025

Slutår

2029

Beviljade finansiering

Tommi Jauhiainen Orcid -palvelun logo
695 479 €

Finansiär

Finlands Akademi

Typ av finansiering

Akademiforskare

Beslutfattare

Forskningsrådet för kultur och samhälle
17.06.2025

Övriga uppgifter

Finansieringsbeslutets nummer

370756

Vetenskapsområden

Språkvetenskaper

Forskningsområden

Kielitieteet