Developing named-entity recognition for state authority archives
Publiceringsår
2025
Upphovspersoner
Toivanen, Ida; Poso, Venla; Lipsanen, Mikko; Välisalo, Tanja
Abstrakt
Named entity recognition (NER) is one of the more common natural language processing tasks, that usually entails the detection of entities like person, location and date from textual data. Due to the bureaucratic language present in the data from state authority archives, existing NER models may not perform as well as researchers utilising them would wish. The diversity of the archival data, containing texts from different domains, as well as noise due to imperfect optical character recognition (OCR), creates challenges for NER. This gave us an incentive to train our own NER model, FinArcNER, and see if our attempts would produce better classification results in an archival setting. The aim of our study was to answer the following research questions: 1) Does training with noisy archival data bring the needed improvement to the model performance? 2) Does the training with noisy archival data skew the results with non-archival data? The FinArcNER model shows consistent performance when tested with modern and archival data (F1 scores 0.9200 and 0.8710, respectively). We can deduce from this that the increased diversity of the training data improved the model performance – that is, even though we included archival data with OCR noise, the model still learned to detect named entities correctly from noise-free, non-archival data.
Visa merOrganisationer och upphovspersoner
Publikationstyp
Publikationsform
Artikel
Moderpublikationens typ
Konferens
Artikelstyp
Annan artikel
Målgrupp
VetenskapligKollegialt utvärderad
Kollegialt utvärderadUKM:s publikationstyp
A4 Artikel i en konferenspublikationPublikationskanalens uppgifter
Moderpublikationens namn
Förläggare
Volym
7
Nummer
3
ISSN
Publikationsforum
Publikationsforumsnivå
1
Öppen tillgång
Öppen tillgänglighet i förläggarens tjänst
Ja
Öppen tillgång till publikationskanalen
Helt öppen publikationskanal
Parallellsparad
Ja
Övriga uppgifter
Vetenskapsområden
Data- och informationsvetenskap; Historia och arkeologi; Övriga humanistiska vetenskaper
Nyckelord
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Publiceringsland
Norge
Förlagets internationalitet
Internationell
Språk
engelska
Internationell sampublikation
Nej
Sampublikation med ett företag
Nej
DOI
10.5617/dhnbpub.12262
Publikationen ingår i undervisnings- och kulturministeriets datainsamling
Ja