Massively multilingual modeling of registers in web-scale corpora
Bidragets beskrivning
This project combines the long traditions of corpus linguistics and the latest innovations of natural language processing (NLP) to explore web registers—situationally defined Internet text varieties such as news, blogs or how-to pages—on a massively multilingual scale. Specifically, the project 1) analyzes language-specific differences of registers and creates a data-driven description of the full range of web registers in six languages, 2) develops machine learning methods for the large-scale modeling of registers and their identification in massively cross/multi-lingual settings, and 3) automatically identifies registers in Universal Parsebanks, a language resource spanning 100 billion words and 64 languages. Thereby, the project provides critical knowledge about online communication and methods with which to develop web data from simple masses of raw, unstructured text toward organized resources with rich contextual information.
Visa merStartår
2020
Slutår
2024
Beviljade finansiering
Övriga uppgifter
Finansieringsbeslutets nummer
331297
Vetenskapsområden
Språkvetenskaper
Forskningsområden
Soveltava kielitiede
Temaområden
Nuori tutkijasukupolvi 2019
Identifierade teman
languages, speech, linguistics