Unseen Languages in Language Identification
Bidragets beskrivning
One of the most challenging issues in language identification is the handling of texts written in unseen languages. Systems ground their predictions on training corpora for a finite number of languages. Almost exclusively, the methods label whatever text they encounter as one of the languages in their repertoire. If they encounter text written in an unknown language, they label it with the language they deem closest. The results can vary from the indicated language being a close relative to a seemingly random choice. The handling of unseen languages was already 2006 stated as an outstanding issue, but it still remains an issue without any real solutions. We have gathered a highly skilled group of collaborators with whom we will inspect several case studies where unseen languages pose practical problems for researchers or the users of language resources created by them. We aim to significantly improve the understanding of the phenomenon and the methods used to handle it.
Visa merStartår
2025
Slutår
2029
Beviljade finansiering
Finansiär
Finlands Akademi
Typ av finansiering
Akademiforskare
Beslutfattare
Forskningsrådet för kultur och samhälle
17.06.2025
17.06.2025
Övriga uppgifter
Finansieringsbeslutets nummer
370756
Vetenskapsområden
Språkvetenskaper
Forskningsområden
Kielitieteet