undefined

Developing named-entity recognition for state authority archives

Publiceringsår

2025

Upphovspersoner

Toivanen, Ida; Poso, Venla; Lipsanen, Mikko; Välisalo, Tanja

Abstrakt

Named entity recognition (NER) is one of the more common natural language processing tasks, that usually entails the detection of entities like person, location and date from textual data. Due to the bureaucratic language present in the data from state authority archives, existing NER models may not perform as well as researchers utilising them would wish. The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates challenges for NER. This gave us an incentive to train our own NER model, FinArcNER, and see if our attempts would produce better classification results in an archival setting. The aim of our study was to answer the following research questions: 1) Does training with noisy archival data bring the needed improvement to the model performance? 2) Does the training with noisy archival data skew the results with non-archival data? The FinArcNER model shows consistent performance when tested with modern and archival data (F1 scores 0.9200 and 0.8710, respectively). We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.
Visa mer

Organisationer och upphovspersoner

Jyväskylä universitet

Toivanen Ida Orcid -palvelun logo

Poso Venla Orcid -palvelun logo

Publikationstyp

Publikationsform

Artikel

Moderpublikationens typ

Konferens

Artikelstyp

Annan artikel

Målgrupp

Vetenskaplig

Kollegialt utvärderad

Kollegialt utvärderad

UKM:s publikationstyp

A4 Artikel i en konferenspublikation

Öppen tillgång

Öppen tillgänglighet i förläggarens tjänst

Ja

Öppen tillgång till publikationskanalen

Helt öppen publikationskanal

Parallellsparad

Ja

Övriga uppgifter

Vetenskapsområden

Data- och informationsvetenskap; Historia och arkeologi; Övriga humanistiska vetenskaper

Nyckelord

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Publiceringsland

Norge

Förlagets internationalitet

Internationell

Språk

engelska

Internationell sampublikation

Nej

Sampublikation med ett företag

Nej

DOI

10.5617/dhnbpub.12262

Publikationen ingår i undervisnings- och kulturministeriets datainsamling

Ja