Turku Children's Book Corpus

Beskrivning

A corpus consisting of Finnish children's books intended to be read by children in Finnish basic education. We have named the corpus "Turku Children's Book Corpus", or TCBC for short, as it has been created by the TurkuNLP research group at the University of Turku. Version 1.1 (released 2.10.2025): 525 books in total, 175 per age group. A total of 360 fiction, 45 non-fiction, and 120 textbooks. Size ~20.2 million words. Version 1.0 (released 13.6.2025): 300 books in total, 100 per age group. A total of 210 fiction, 45 non-fiction, and 30 textbooks. Size ~11.6 million words. In the dataset are the following files: - physically-scanned-imgs : Physically taken photos of each page in each physical book - elibrary-book-imgs : Images of each page in each e-book - google-doc-ai-layouts : OCR output from Google Document AI's Layout processor for the images of each book - corrected-raw-texts : Raw texts from the GoogleDocAI layouts. Manually checked to fix most OCR mistakes - trankit-jsons : Trankit output for each raw text file, used in creating the CoNLLU files - conllus : Full texts of books that have been morphosyntactically parsed with Universal Dependencies annotations and follow the CoNLLU file format - metadata : Various metadata files containing information on the books included in the dataset There are different versions of TCBC and each version has been designed such that it contains and equal number of books per age group, which have been designated as ages 7-8 (also includes some books for younger children), 9-12, and 13+. There are also three different genres taken into account: novel, textbook, non-fiction and there is an equal number of books for each genre in each age group. There are also some additional books that have been partially processed, but not a part of the most recent version of the corpus to adhere to this rule. The list of the specific books for each version can be found in the metadata files.
Visa mer

Publiceringsår

2025

Typ av data

Upphovspersoner

Tapio Nojonen Orcid -palvelun logo - Kurator, Utgivare, Upphovsperson

Filip Ginter Orcid -palvelun logo - Medarbetare

Jenna Kanerva Orcid -palvelun logo - Medarbetare

Kiia Korsu Orcid -palvelun logo - Medarbetare

Mikko-Jussi Laakso Orcid -palvelun logo - Medarbetare

Veronika Laippala Orcid -palvelun logo - Medarbetare

Projekt

Övriga uppgifter

Vetenskapsområden

Data- och informationsvetenskap

Språk

finska

Öppen tillgång

Begränsad tillgång

Licens

Other (Not Open)

Nyckelord

finnish, children's literature, corpus

Ämnesord

Temporal täckning

undefined

Relaterade till denna forskningsdata