Massively multilingual modeling of registers in web-scale corpora

Bidragets beskrivning

This project combines the long traditions of corpus linguistics and the latest innovations of natural language processing (NLP) to explore web registers—situationally defined Internet text varieties such as news, blogs or how-to pages—on a massively multilingual scale. Specifically, the project 1) analyzes language-specific differences of registers and creates a data-driven description of the full range of web registers in six languages, 2) develops machine learning methods for the large-scale modeling of registers and their identification in massively cross/multi-lingual settings, and 3) automatically identifies registers in Universal Parsebanks, a language resource spanning 100 billion words and 64 languages. Thereby, the project provides critical knowledge about online communication and methods with which to develop web data from simple masses of raw, unstructured text toward organized resources with rich contextual information.

Visa mer

Startår

2020

Slutår

2024

Beviljade finansiering

Veronika Laippala

Åbo universitet

480 000 €

Finansiär

Finlands Akademi

Typ av finansiering

Akademiprojekt

Utlysning

Academy Project 2019

Övriga uppgifter

Finansieringsbeslutets nummer

331297

Vetenskapsområden

Språkvetenskaper

Forskningsområden

Soveltava kielitiede

Temaområden

Nuori tutkijasukupolvi 2019

Identifierade teman

languages, speech, linguistics