Manually annotated images for handwritten text recognition of Finnish migration records, including annotations for page de-skew, table structure, cell type classification, text recognition, and year recognition.
Source images: Source images for annotations can be downloaded from https://zenodo.org/records/15836012 (jpg format).
Annotations: Annotations are in PageXML format (version 2013-07-15). File names can be used to pair annotations with corrsponding source images.
Train/Dev/Test split: The data is divided into training, development and test directories. In addition to these, pielavesi-directory includes additional annotations, but note that pielavesi-annotations are not radomly sampled, all being from the Pielavesi parish. We suggest to use these as additional training data.
See https://github.com/TurkuNLP/finnish-migration-data for more information about the project.
Annotations: CC-BY
Source images: See the license information in https://zenodo.org/records/15836012.