Parallel Sentence Aligned Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2020, source

Beskrivning

This resource is available for download in Kielipankki – the Language Bank of Finland. This is a parallel corpus created of the Yle news articles from 2014-2020 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the sentence level. It is based on the two parallel document-level datasets of Yle News articles available on Kielipankki (http://urn.fi/urn:nbn:fi:lb-2022111625 and http://urn.fi/urn:nbn:fi:lb-2024011701). The dataset spans the period from September 2014 to December 2020. This dataset is comprised of the following parts: 1) Sentence alignments: parallel documents from regular and Easy Finnish Yle news articles aligned sentence-by-sentence. Only the "positive" documents were taken from the 2019-2020 dataset (http://urn.fi/urn:nbn:fi:lb-2022111625). All but 50 documents were aligned automatically with Vecalign (https://github.com/thompsonb/vecalign) using LASER embeddings (https://github.com/facebookresearch/LASER). Each document has the following columns: 1.1) pair_id: an id comprised of three parts divided by a double underscore: the id of the regular document, the id of the Easy Finnish document (with a singular underscore), and the sentence pair number. 1.2) regular_string: a sentence from the regular Finnish article. 1.3) selko_string: a corresponding sentence from the Easy Finnish article. 1.4) score: the confidence score given by Vecalign. The lower the score, the more similar the sentences. The "good" pairs are estimated to have a score below or equal to 0.65; however, the score is not definitive proof of whether the sentences in the pair truly match in meaning. The zero score is assigned when a sentence has no pair. The scores for all non-zero sentence pairs in manually aligned documents are set to 0.(3). 2) Golden sentence alignments: 50 documents aligned manually by a human assessor (text). Also available in the ladder format (indexes).
Visa mer

Publiceringsår

2024

Typ av data

Upphovspersoner

Finnish Broadcasting Company (Yle) - Upphovsperson

University of Helsinki

Anna Dmitrieva Orcid -palvelun logo - Kurator

Projekt

Övriga uppgifter

Vetenskapsområden

Språkvetenskaper

Språk

finska

Öppen tillgång

Begränsad tillgång

Licens

CLARIN ACA+NC (Academic, Non Commercial) End User License 1.0

Nyckelord

Ämnesord

Temporal täckning

undefined

Relaterade till denna forskningsdata