1 Introduction

In the last decade, deep neural networks have been the state-of-the-art in challenging domains [105]. The key to the success of deep neural networks is the availability of data with ground truth used for training and evaluation of such methods. Although this is the case for Document Image Analysis [109], especially for modern documents [71] and scene text detection and recognition [81, 82], the performance remains low for historical documents compared to other Computer Vision problems that use very large databases [154].

As deep learning networks require a large amount of data, large datasets are required. In recent years, a variety of datasets have appeared in journals, conference proceedings, and the competitions they host, for tasks such as document classification, word spotting, layout analysis, graphical object detection and handwriting recognition for both modern and historical document images. Despite their complexity due to inevitable degradation by centuries of usage and their variability, the analysis of historical documents has attracted much interest. The growth of digital libraries has contributed to the research on historical documents by providing high-quality digitized images to researchers to process and analyze. As a result, researchers have introduced several datasets by managing and annotating the collections provided by these libraries.

1.1 Purpose and contributions

The main goal of this work is to provide researchers with an overview of datasets and machine learning tasks. In particular, we give an overview of the publicly available historical document image datasets and report the benchmark tasks and results. We also report the results based on some datasets that have been used in recent competitions. Thus, we refer to the appropriate method for every task and dataset while finding gaps and challenges in the field.

The major contributions of this work can be summarized as follows:

  • A systematic literature review of historical document image datasets is presented.

  • A summary of 65 historical document image studies grouped into document classification, document structure, and content analysis related tasks.

  • A tabular overview of the statistics, classes, tasks, languages, type of document, input visual aspects, ground truth information, and benchmarks for every dataset.

  • A discussion on the challenges, gaps and future research directions is provided.

Fig. 1
figure 1

a A document page and its example, b metadata information related to document classification, c document structure, and d content analysis

1.2 Selection methodology

For the systematic literature review, we searched for the datasets presented at the International Conference on Document Analysis and Recognition (ICDAR) and the International Conference on Frontiers of Handwriting Recognition (ICFHR) over a time span of six years (2016–2021). For dataset identification, we further used the keyword queries “historical,” “document,” “image,” “analysis” and “dataset” in the Google Scholar academic research database. We limited our search to the first 98 pages, or 980 results out of the 188K total results, as the database does not seem to operate after that number of data. We filtered the suitable papers by reading the abstracts and excluding those unrelated to historical document image datasets. Furthermore, through backward snow-balling related works, we included a few datasets that were not retrieved by the database. The final number of studies that we present in this paper is 65.

1.3 Scope

This paper mainly focuses on technical aspects that could facilitate machine learning researchers in making full use of available datasets. Other aspects that may be more relevant for experts from the humanities field, such as historical and paleographic analyses to understand the context of a historical document and how it was used, are beyond the scope of this paper. For every dataset, we summarize the size, language, targeting tasks, type of annotations, and benchmark results. We further provide detailed tables that include dataset statistics, visual aspects of the input images, ground truth format, and the provided benchmarks.

The study is organized as follows. In Sect. 3, we report the existing datasets for historical document image analysis, organizing them according to the following categories:

  1. 1.

    Document classification

  2. 2.

    Document structure

  3. 3.

    Content analysis

Sections 3.13.2, and 3.3 include studies related to the aforementioned categories. Furthermore, Table 1 shows the listed datasets and their corresponding section, release year, writing type (handwritten or printed), and checkmarks according to the information and tasks they are or they could be used for. Finally, we present a large table (see Table 2) with information on the reviewed datasets considering the statistics, classes, tasks, languages, document type, input image aspects, ground truth, and benchmarks. We should note that to fully understand the table, the reader should address the corresponding section of every dataset. An illustration of this table can be seen in Fig. 2. In Sect. 4, we discuss observations, challenges and future directions for the domains of interest. Finally, we present our conclusions in Sect. 5.

2 Related work

Several works addressed the topic of document image analysis as surveys. The work presented in [178] focused on the task of automatic document processing. This task was split into document analysis and document understanding, related to the layout structure and logical structure of a document, respectively. Lombardi and Marinai [105] surveyed papers that use deep learning methods for historical document image analysis by showing the connections between input and output for all methods and according to task. This paper further presented some historical datasets. Other surveys are task-specific. [78] provided a survey focusing on databases and benchmarks for handwriting recognition. Furthermore, [131] focused on both on-line and off-line handwriting recognition. Likforman-Sulem et al. [102] focused on automatic text-line segmentation methods for historical documents. [121] concentrated on evaluation metrics and tools for Optical Character Recognition (OCR). Finally, other works focus on layout analysis [120] or word spotting techniques [65]. The work presented in [13], surveyed layout analysis techniques and listed several historical and modern datasets related to the task. To the best of our knowledge, we present the first literature review that systematically focuses on historical datasets and benchmarks.

3 Datasets

Various datasets exist for historical document image analysis for different tasks, such as layout analysis, baseline detection, handwriting recognition, binarization, and writer identification. In this review, we group the tasks discovered through our methodology as subtasks of document classification, document structure, and content analysis. Moreover, the main components of a document are its metadata, structure, and content as shown in Fig. 1. Notably, the actual document analysis pipelines can become much more complex in real scenarios,Footnote 1 but for a general grouping of the tasks, these three categories are sufficient. We present the existing historical document image datasets considering these components and order them by release date (earliest to latest) within every subsection. Furthermore, a tabular overview of these datasets and checkmarks on the subtasks they include according to their ground truth is presented in Table 1. Since several datasets can belong to more than one task category, we place them in the first section where a benchmark task exists; however, in Table 1 we checkmark all possible tasks.

Table 1 Listed datasets and their release year, summary section, document type, whether the dataset is a part of a competition, and information about the different tasks that are divided according to document classification-level, structure, or content analysis content

3.1 Document classification datasets

To process and understand a document, it is vital to categorize it. When and where was this document written? What font or script was used in this document (or parts of it, if revisited later)? This information is helpful for the steps of document understanding, as it provides context to the researcher about the era, the writer, and more. Document classification refers to the categorization according to a document’s geographical and chronological occurrence and its written script or font. In the following subsections, we summarize six datasets based on date/age, font/script, and location classifications. Three studies refer to competitions highly related to each other, and the other three studies introduce datasets for font classification as the main task.

3.1.1 ICFHR 2016 competition on the classification of medieval handwritings in Latin script (ICFHR16 CLaMM)

The ICFHR16 Competition [37] provided a collection of grayscale Latin manuscript images for script type classification. Two tasks were proposed in this competition: Task 1, which uses a single label for each image, and Task 2, which uses multi weighted labeling for each image. The dataset is comprised of 12 script classes and 3 sets, the training set contains 2K images and the test sets for Tasks 1 and 2 contain 1K and 2K images, respectively. The average accuracy (Acc) and the average intraclass distance were used for the evaluation of the proposed systems. The system that achieved the highest accuracy for Task 1 used I-vector extraction [43] on image patches and a classification of the extracted vectors using latent Dirichlet analysis (LDA). For Task 2, the best performing system in terms of Final Score utilized a neural network architecture named DeepScriptFootnote 2 and pre-processing to yield information on random image crops and their various perturbations to the network classifier as input. In terms of the average intraclass distance, the higher ranked system in both tasks was the FRDC-OCR and consisted of a Convolutional Neural Network (CNN) classifier that was applied on patches, where for every image, the result is the average of the recognition confidence and feature vector across its patches.

3.1.2 ICDAR 2017 competition on the classification of medieval handwritings in Latin script (ICDAR17 CLaMM)

Similar to the ICFHR16 CLaMM presented in Sect. 3.1.1, tasks were proposed for the ICDAR17 CLaMM competition [38]. Task 1 was a script type classification, and Task 2 was a script type classification on heterogeneously encoded data. The available training dataset for these tasks contains 3,500 manuscript images from the previous year’s competition. Three thousand of these training images were used and their labels were further extended for Task 3, manuscript date classification, and Task 4, manuscript date classification on heterogeneously encoded data. The date classification data were distributed across 15 classes, ranging from 500 C.E. to 1600 C.E. Tasks 1 and 3 were evaluated using a 2K image test set, while tasks 2 and 4 were evaluated using a 1K image test set. For Tasks 1 and 2, the evaluation criterion was the accuracy per script type, while for Tasks 3 and 4, the evaluation criterion was accuracy per date. The winners of Tasks 1 and 3 applied T-DeepCNN, a CNN with residual connections [72] and batch normalization [79] on patch images of 227 x 227 pixels, averaging over the patches of every image for a final prediction. They further enhanced the performance of their model by using an ensemble of 5 CNN classifiers. For Tasks 2 and 4, the winning approach (CK2) was a linear Support Vector Machine (SVM) classifier with a squared hinge loss on vectors derived from the combination of the PCA-whitened RootSIFT local descriptors and the global vector of locally aggregated descriptors (VLAD). This system was based on [25, 26].

3.1.3 KERTAS

The KERTAS dataset [1] consists of handwritten Arabic manuscripts from the Qatar National Library intended for age and writer detection. This dataset contains 2502 high-resolution document images and their corresponding date annotation according to the Islamic century in which they were written. The dataset provides the additional source, manuscript name, description, writer name, and ID information in XML format for every manuscript. Furthermore, an age detection algorithm based on sparse representations was introduced in this work and results on the dataset using different image size inputs were shown. This method was also compared with three different writing style feature algorithms, Run Length [19], Edge Direction [17], and Edge Hinge [45], with 3-NN as the classifier. Both sets of experiments were made using pre-defined and random splits, and the pre-defined splits performed better in all cases. The proposed method on the \(50 \times 50\) image size achieved the best performance.

3.1.4 Dataset of pages from early printed books with multiple font groups

This dataset [161] contains a set of 35,623 images for font classification and consists of 12 different classes: Antiqua, Italic, Textura, Rotunda, Gotico-Antiqua, Bastarda, Schwabacher, Fraktur, Greek, Hebrew, “Other Fonts” and “Not a Font.” Several of the data samples have multiple labels; thus, this dataset is appropriate for testing multilabel classification methods. This dataset is considered highly imbalanced. Baseline results on ResNet with 50 and 18 layers [72], VGG with 16 layers [170], and DenseNet with 121 layers [75] were presented, and the mean intersection over union (mIoU) metric was reported. This dataset was further used on the historical document classification competition [163] that was part of the 16th ICDAR 2021. Further information about this competition is presented in 3.1.5.

3.1.5 ICDAR 2021 competition on historical document classification (HDC 2021)

The competition on historical document classification [163], which was hosted at ICDAR 2021 included three tasks for single or multilabel document classification: font/script, location, and date. For the first task of font/script classification, the organizers provided the multiple font dataset presented in Sect. 3.1.4 as the training dataset for the font classification task and the ICDAR17 (3.1.2) and ICDAR16 (3.1.1) CLaMM datasets for the script classification task. Two new test sets were introduced for each task. For the date classification tasks, a new training and test set containing 11,294 and 2,516 images, respectively, was introduced. The ICDAR17 CLaMM dataset was suggested as an additional training set. New training, validation, and test sets of French manuscript images with 13 location labels were introduced for the location task. The competition results were evaluated using the top-1 accuracy on the test sets for the font/script and location tasks and the mean average error (MAE) for the date classification task. For all tasks, the winning team used CNN operating either on non-overlapping patches of four different scales or text lines acquired through segmentation [91]. For the date classification task, instead of a cross-entropy loss, which was used in the other tasks, an interval regression loss was used that treated the task as a regression problem.

3.1.6 The BIR database

The Bold-Italic-Regular (BIR) database [8] consists of printed historical document pages with word bounding boxes and three font classes (bold, italic, and regular) meant for word detection and font style classification. The BIR database includes 285 scanned pages from various catalogues from the nineteenth and twentieth centuries written in French, Latin, or other languages. Baseline results using 50% training, 25% validation, and 25% test random splits (TVT), and a fivefold cross-validation (CV5) were presented. For word detection, YOLOv5m was utilized [80], and for style classification, MobileNetV2 [157] and Xception [24] were utilized. All models were evaluated according to the F1 score. The MobileNetV2 style classification results were also compared with the results of a human expert on 1K randomly chosen words. The results showed a similar performance between the model and the human expert.

Fig. 2
figure 2

Illustration of Table 2 and its attributes. A document image dataset that can be characterized by different statistics, languages, and types of documents. The dataset contains input images for different tasks, of specific visual aspects (mode, resolution and format) and their corresponding annotation type and format. Then benchmarks with different models, metrics, and performances are created for the different tasks

3.2 Document structure datasets

The structure of a document refers to the organization of every element within it. After the detection of the different objects present in a document, the aim is to classify these objects. A document may consist of several blocks of text such as title, paragraphs, main body text, text lines, graphics, tables and more. The organization of these elements in specific places of the document constitutes the layout of the document. Detecting and extracting information is essential to get the geometry presented in a document. Many datasets are publicly available to promote research that deals with the structure of documents. In this subsection, we present the datasets aimed for tasks related to the geometric structure of documents such as layout analysis (detection and segmentation), baseline and text line detection, table detection, and graphic recognition. We also include binarization, although it could be considered as a pre-processing step. The number of summarized studies in this subsection is 24.

3.2.1 Persian handwritten text dataset (PHTD)

PHTD [2] is a 140-page dataset of handwritten documents in the Persian language. The dataset includes 1,787 text-lines and 27,073 words for text recognition and word and line segmentation tasks. For the former task, a unicode text file is included for every page, and for the latter, a pixel-labeled file is provided as the ground truth. Two algorithms were utilized for the task of text-line segmentation. The potential piece-wise separation line (PPSL) method [3] obtained an 89.43% segmentation accuracy, while the other method proposed by Alaei et al. [4], outperformed PPSL with an accuracy of 94%. In both methods, each document was split vertically into stripes.

3.2.2 Persian, Bangla, Oriya and Kannada (PBOK)

The PBOK dataset [5] includes images of 707 handwritten pages in four languages by numerous writers. More specifically, this dataset provides 140 pages written in Persian, 228 pages written in Kannada, 199 pages written in Bangla, and 140 pages written in Oriya pages. This dataset contains both pixel- and content-level annotations. PBOK is considered quite complex, as it contains handwriting in both directions (left to right and right to left) and overlapping text. The authors conducted line segmentation experiments for each language part and for the whole dataset using two algorithms, the potential piecewise separation line (PPSL) [3] and the method proposed by Alaei et al. [4]. The Alaei et al. method achieved a detection rate (DR) of 91.33%, a recognition accuracy (RA) of 90.41%, and an overall segmentation result (TLDM) of 90.87%, and outperformed PPSL, which achieved values of 88.07%, 86.69%, and 87.38%, respectively.

3.2.3 IMPACT

IMPACT [129] is a large-scale dataset of over 600K images derived from different European libraries. The provided PAGE XML files [132] contain the layout, reading order, and text transcription annotations for over 45K samples. This collection also offers metadata information, including bibliographic information, digitization information, physical characteristics, copyright information, administrative information, and comments. The dataset includes documents written in 18 different languages (Bulgarian, Catalan, Czech, Dutch, English, French, German, Greek, Hebrew, Latin, Norwegian, Old Church Slavonic, Polish, Portuguese, Russian, Slovenian, and Spanish) and 10 scripts (Bohoričica, Cyrillic, French, Gaj, Greek, Hebrew, Latin, Latin/Gothic, Old Cyrillic, and Serif). A web interface provides access to the image samples and annotations, giving various options for the users to browse and search. This paper did not present any benchmark results for this dataset.

3.2.4 Europeana newspapers project (ENP)

The Europeana Newspapers Project (ENP) dataset [32] consists of European cultural heritage documents of 13 different languages from 12 European libraries published in newspapers from the seventeenth to the twentieth century. All page images in the dataset are either 300 or 400 dpi, and there is a broad distribution of grayscale, bitonal, and color pages. The ground truth contains region outlines and their labels (text lines, text regions, tables, images/graphics, and blocks/zones), Unicode text transcriptions, and reading order. In addition to conventional downloading, this dataset can be accessed through a web interface with several services such as document and attachment retrieval and a search function to determine if a document exists in the database. A performance evaluation for OCR and layout analysis was conducted using a commercial system (ABBYY FineReader 11) and an OCR system (Tesseract 3.03).Footnote 3

3.2.5 GRPOLY-DB

The Greek polytonic database (GRPOLY-DB) [62] contains images of printed and handwritten documents from different sources distributed across four subsets: GRPOLY-DB-Handwritten, GRPOLY-DB-MachinePrinted-A, GRPOLY-DB-MachinePrinted-B, and GRPOLY-DB-MachinePrinted-C. The documents were written or printed in the old polytonic system from 1838 to 1977. The overall dataset includes 399 pages, 15,084 text lines, 102,596 words, and 171,511 characters with ground truth. The provided annotations include text and word-level segmentation information and transcriptions for text recognition and isolated character recognition. Layout-related and content-related experimental results were shown on the dataset using various methods. For text-line segmentation, a shredding-based system [123], which achieved a value of 94.58% for the average F-measure on the four subsets, outperformed a Hough transform method. For word segmentation, sequential clustering [89] and Gaussian mixture-based methods [106] were applied. The former method achieved the highest average FM of 94.85%. The GRPOLY-DB-MachinePrinted-B was the only subset used for isolated character recognition in two scenarios, one scenario with all character instances and another with 30 random samples per class. Thus, there were 143,051 and 3,750 characters per scenario, respectively. Two algorithms were evaluated for both scenarios, HoG features [41] with an SVM classifier and adaptive window features [61] with a k-NN classifier. For both systems, the first scenario obtained the highest RA. Between the two methods, HoG features with the SVM classifier obtained the highest metric values in both scenarios. In addition, OCR experiments were performed at the character and word levels using Tesseract\(^3\) and ABBY FineReader, with the latter performing the best. Finally, the mean average precision (mAP) was presented for query-by-example word spotting, where profile features with dynamic time warping (DTW) [145] for feature vector comparison obtained the best results on the whole dataset.

3.2.6 DIVA-HisDB

DIVA-HisDB [168] is a database that contains 150 images derived from three medieval manuscripts from the eleventh and fourteenth centuries with complex layouts. This database provides 20 training, 10 validation, 10 test, and 10 left out images for every manuscript annotated at pixel-level using the PAGE format for the following classes: main text body, decorations, and comments. Benchmark results applying convolutional autoencoders (N-light-N) [160] showed an average accuracy of approximately 95% for pixel classification and the accuracy for every category. Additionally, the challenges of the dataset in terms of text-line segmentation using the Seam Carving [10] and OCRopus [16] methods were demonstrated. The HisDoc-Layout-Comp competition of ICDAR 2017 [169] used the DIVA-HisDB dataset to evaluate systems on layout analysis (Task 1), baseline detection (Task 2), and text-line segmentation (Task 3). For layout analysis, the best overall performance in terms of mIoU was achieved by a fully convolutional network (FCN) that segmented every image at pixel-level. The best performing system for Tasks 2 and 3 deployed adaptive run length smoothing (ARLS) [124] to propose text lines and then processed them using the seam carving algorithm.

3.2.7 HBA 1.0

HBA 1.0 [113] is a collection of 11 books with 4,436 pages of manuscripts and printed documents from the Gallica digital library from the thirteenth to the nineteenth century written in different scripts and languages. This dataset provides either ground truth images, where every foreground pixel has a different color according to the class it belongs to out of the six pre-defined classes, which include graphics, main text body, capitalized text, handwritten text, italic text, footnote text, or text files containing the label of every pixel. As a baseline, this paper presented the pixel-level classification accuracy (CA) of a texture-based layout segmentation method [112] that, averaging over 4 books, was 75.9%. The ICDAR 2017 and ICDAR 2019 Competition on Historical Book Analysis [114] introduced this dataset into two tasks: textual and graphical content discrimination at pixel-level and pixel-level annotation of textual content. The highest overall performance for both tasks was achieved by an FCN that performed on \(512\times 512\) patches using a weighted cross-entropy loss function.

3.2.8 READ-BAD

The READ-BAD dataset [68] contains 2,035 images of simple and complex documents with 132,123 baselines for baseline detection. The images were derived from 9 European archives written from 1470 to 1930, and as ground truth, the PAGE XML annotation format [132] was used. The ICDAR 2017 Competition on Baseline Detection (cBAD) [44] used the READ-BAD dataset for Track A, text-line segmentation on simple documents, and Track B, text-line location on complex documents with noise and various layout elements, providing only the page. This dataset provides 216 training images for the simple layouts and 270 for the complex layouts. An evaluation scheme for baseline detection was introduced using the R-value, P-value, and F-value. The R-value and the P-value have similarities to the recall and precision metrics, respectively, while the F-value is the harmonic mean of the two values. The best performing system for both tracks used a U-Net-based architecture (DMRZ submission) that extracted baselines and text regions of interest. Moreover, a layout classification was performed according to the provided classes as a pre-processing step. Post-processing further improved the predicted baselines through detection error pruning and baseline fragment merging.

3.2.9 Warped Arabic

In [47], a dataset of 200 historical Arabic document images from the sixteenth to the nineteenth century and four different libraries was introduced. The images were derived from books, newspapers, legal documents, and journals and contain a variety of layouts and states of degradation that help in facing challenges in baseline detection and text-line segmentation tasks. The PAGE XML ground truth [132] contains text line-level information and metadata information according to bibliographic knowledge (author, title, date, location, document type, and page number), physical properties (language, script, font, and number of columns), and copyright data. Results were shown for text-line segmentation using four methods: Voronoi diagrams, a smearing method, a hybrid approach, and a projection profile-based method. For these methods, three warping percentages were utilized: 0%, 25%, and 50%. The results suggested that the Voronoi diagrams achieved the highest success rate. Finally, the more warped the text lines are, the more challenging the task is. Thus, the performance decreases as the curvature increases.

3.2.10 Oficio de Hipotecas de Girona (OHG)

The OHG datasetFootnote 4 is a set of 596 pages of Spanish deeds written from a single writer on the eighteenth century with a complex layout and six different layout regions that contain only text which are: page number, notarial typology, paragraph of text that begins next to a notarial typology, paragraph that begins on a previous page, marginal note, and marginal note added a posteriori to the document. The dataset includes more than 23,700 lines and a 2,400-word vocabulary and the PAGE XML ground truth files for both layout analysis and handwritten recognition.

3.2.11 Pinkas

The Pinkas dataset [93] is a collection of 30 handwritten medieval Hebrew pages intended for page, line, and word segmentation tasks. The training and test sets contain 10,397 and 3,278 images, respectively. These images are derived from records of European Jewish communities from 1500 to 1800. The mAP of different word spotting methods, including CNN variations, was utilized in this study. Three methods were set to provide a baseline for the dataset. Siamese CNN [18] and PHOCNet [171] were compared as segmentation methods and an SVM with HOG descriptors [6] was used as a segmentation-free method. The Siamese CNN, which achieved a mAP of 61.5%, outperformed the other methods. PHOCNet achieved a mAP of 53.3% using one-hot encoding and 56.6% without it. The SVM did not perform well, as it achieved a mAP of 1.5%.

3.2.12 BADAM

BADAM [88] is a baseline detection dataset containing 320 training and 80 test pages of Arabic and Persian handwritten text. These pages are derived from different sources and contain medical tracts and religious, legal, poetic, and other various content. This dataset provides 107,700 lines in two formats of annotations: PAGE XML [132] and bitmasks. The evaluation scheme proposed in [68] was utilized for a convolutional baseline layout analysis (C-BLLA) system that classified every baseline pixel using a U-Net model [152] and then extracted the baseline. The evaluation scheme from READ-BAD (Sect. 3.2.8) was used and the P-value, R-value, and F-value metrics were presented for the model on the BADAM and Latin cBAD [44] test sets. The results suggested that baseline detection in Arabic script is more challenging than in Latin script.

3.2.13 HORAE

The HORAE dataset [15] contains 557 images derived from books of hours and their corresponding layout and text-related annotations. These images originate from the full HORAE corpus, which consists of 500 manuscripts and 107,227 pages. To create the final HORAE dataset with the annotated pages, a selection pipeline was used that initially classified pages into the following classes: binding, white page, calendar, miniature, miniature-and-text, text-with-miniature, and full-page text. They excluded the ones that were binding or white pages and kept two images per class. Then, the filtered pages were clustered from the initial step to keep one from every class and detect the ones considered outliers because of their rare layout. For the final 557 image set, the centroids from the most frequent layouts and the strongest outliers were annotated. A PAGE XML [132] file accompanies every image of the final set with annotations for page, miniature, border elements, initials, and other decorations found in the text body, such as line filler, music notations, and ornaments. Benchmark results were presented for line detection and layout analysis using the dhSegment segmentation neural network [126] and evaluated according to the IoU with different thresholds and post-processing.

3.2.14 ICDAR 2019 competition on table detection and recognition (cTDaR)

The cTDaR competition of 2019 [59] held two tracks: Track A for table region detection and Track B for table recognition, and two datasets, modern and historical. The historical document dataset includes civil records containing various handwritten tables sourced from 23 different institutions. For the table detection track, the dataset provides 600 training and 199 test images. The table recognition track provides 600 training and 150 test samples for two subtracks: B1, which provides the tables regions, and B2, which does not provide any a priori knowledge. Hence, there is a need for both region and structure detection for B2. The results from 11 teams for track A and two teams for track B were compared. The winning team for track A achieved a weighted average (WA) F1 score of 0.94 for the historical documents by using a classifier to categorize modern and archive samples and Faster-RCNN [147] for table detection. Then, they merged the overlapping regions that exceeded a given threshold value. For the second track, the best submission achieved a WA F1 of 0.48 for B1 and 0.47 for B2. An FCN was used to obtain the tables’ guiding lines and junction points for broken line repair. Then, the cells were extracted through connected component analysis and the row and column range were handled through a neighbor graph.

3.2.15 ICDAR 2019 competition on digitized magazine article segmentation (DMAS2019)

This competitionFootnote 5 aimed to recognize and classify parts of articles present in digitized historical magazines. The competition provided 50-100 annotated images from magazines from 1800 -1938 taken by the National Library of the Netherlands and their layout and OCR ground truth. The annotations include cover, table of contents, content, and index as page classes and article, illustration with caption, advertisement, index, and colophon as article classes. The competition does not seem to provide information about the submitted systems.

3.2.16 ICDAR 2019 competition on document image binarization (DIBCO 2019)

The latest of the DIBCO competition series of 2019 [139] aimed in evaluating various systems for the task of image binarization. The series of this competition initiated in 2009 [60] and had several rounds for printed and handwritten document images [135,136,137]. The 2019 competition included two categories, CATEGORY I, that provided 10 historical handwritten and printed test images of the nineteenth century, and CATEGORY II, that provided 10 test images derived from papyri of various places in Egypt. For CATEGORY I, the best performing method used noise reduction and then an ensemble of three clustering algorithms (Fuzzy C-Means, K-Medoids and K-Means++) for the step of grouping the foreground and background of the input images. The best performing system for CATEGORY II used the neural network architecture LadderNet [194] on \(48\times 48\) image patches. All systems were evaluated using F-measure (FM), pseudo-FM (\(F_{ps}\)), PSNR, and distance reciprocal distortion metric (DRD).

3.2.17 ABP & NAF

The work presented in [133] introduces a method for layout and page sub-division that groups text-lines into semantic objects. In order to evaluate the proposed method, they use the ABP dataset (ABP small) [36] and they further introduce and extension of it, ABP large, and the National Archive Finland (NAF) dataset. ABP small, ABP large, and NAF contain 180, 1,098, and 488 pages, respectively, and were used for table rows, columns, and cells segmentation, where the F1 measure is reported, and shows the most promising results in the cell partition.

3.2.18 Finnish court records-sub500 (FCR)

The FCR dataset [143] includes 500 pages from the Renovated District Court Records of Finland from the nineteenth century. The images are both single- or double-page which makes the dataset quite complex and the corresponding ground truth contains annotations on baseline- and layout-level. The layout regions included are: page number, marginalia, paragraph, paragraph2, table, and table2. The ground truth further includes the line-level transcriptions in the Swedish language.

3.2.19 IlluHisDoc

In [118], a test set of Gallica images named IlluHisDoc was presented for segmentation generalizability purposes. This set was split into four types of documents: (a) printed documents with drawings, photos, ornaments, and paintings; and manuscripts that contain (b) scientific graphs, (c) illuminations, and (d) drawings. Moreover, a segmentation method based on a ResNet-18 [72] backbone encoder-decoder architecture was proposed. The performance of this model was compared with the performance of Tesseract4\(^3\) and Mask-RCNN [74] using pre-training either on the synthetic dataset PubLayNet [193] or on SynDoc, a 10K image synthetic corpus created for this study. The proposed method pre-trained on SynDoc outperformed the other methods according to the mIoU results.

3.2.20 Newspaper navigator

The Newspaper Navigator [99] is a dataset extracted from the Chronicling America historical newspaper collection. This dataset was created by employing an object detection pipeline over the 16.3 million collected pages that extracts visual and headline content. The dataset provides 3,559 images with 48,409 COCO format annotations [103] for easy detection across 7 classes: headline, photograph, illustration, comic, map, editorial cartoon, and advertisement. The textual content of the predicted bounding boxes is further rendered for OCR purposes and ResNet-18 and ResNet-50 embeddings [72] for the different visual category crops. Additional metadata in CSV format contain information such as file path, image URL, page URL, publication date, page sequence number, edition sequence number, batch name, LCCN, bounding box coordinates, prediction score, OCR, place of publication, geographic coverage, newspaper name, and newspaper publisher. A fine-tuned Faster-RCNN model [147] with an R50-FPN backbone achieved a mAP of 63.4% on the validation set. These results also included the AP for every class. The authors further chose 500 pages randomly from 1850–1875 to 1875–1900, treating them as test sets, and presented the mAP on the most frequently appearing classes: the headline, the advertisement, the illustration, and the one class (all visual content into 1 class). These results were slightly worse than the results on the validation set, especially in the case of the 1850–1875 test set.

3.2.21 HJDataset

HJDataset [166] was introduced in the Text and Documents in the Deep Learning Era Workshop hosted by CVPR 2020. The HJDataset contains 2,271 pages from Japanese biography scans for layout analysis with COCO annotations [103], derived by a semirule-based method. Furthermore, the ground truth includes reading order and dependency structure information. Benchmark results of experiments with popular object detection models such as Faster-RCNN [147], Mask-RCNN [74], and Retinanet [104] provided by Detectron2 were shown [186]. Moreover, few-shot and zero-shot learning results using COCO weights were presented.

3.2.22 GloSAT

GloSAT [195] is a table structure recognition dataset of 500 archival images, printed, handwritten or mixed, of meteorological records. There are two types of ground truth in the dataset: individual cell and coarse segmentation cell annotations. In addition to the conventional XML cTDaR19 format annotations, the dataset provides the widely used Pascal Visual Object Classes (VOC) format [50] and extends these formats with cell information such as headers, page type, and table style. A benchmark evaluation of GloSAT (individual cell and coarse segmentation cell separately), cTDaR19, and their combination (+cTDaR19) using CascadeTabNet [134] and CascadeTabNet with additional post-processing proposed by the authors was presented. This post-processing step uses a 1-D DBSCAN clustering algorithm [158] to infer vertical and horizontal lines of a table, assuming that only a subset of cells is needed to place the rest for a rectangular table. The results on the WA F1 score showed that post-processing helps the performance in all experimental cases.

3.2.23 BiblIA

BiblIA [42] is a publicly available dataset of Medieval manuscripts written in Hebrew and Aramaic that contains 6 different scripts: Ashkenazi, Byzantine, Italian, Oriental, Sephardi, Yemenite. BiblIA contains more than 200 images with their corresponding annotations on baseline- and transcription-level, both focusing on the main text. Furthermore, a segmentation and recognition model based on kraken OCR is used for evaluation. The work presents Acc results on specific scripts (Ashkenazi, Italian, and Sephardi) and all scripts as well as training information. Further experimental results show the character error rate (CER) and word error rate (WER) on images not included in the test set.

3.2.24 HisClima

HisClima [150] is a database of handwritten weather ship log book pages from 1880 to 1881 that contains both layout annotations of blocks, columns, rows, and lines and transcription annotations with relevant information such as number of cells in the tables. The dataset comprises 208 pages with tables and 211 pages with descriptive text. Baseline experiments are performed for text recognition, line segmentation, and information extraction. A CRNN with CTC loss performing on line images was used for the recognition task with and without a language model (LM) and evaluated according to the WER and CER. The neural network architecture presented in [142], that performs geometric and logical layout analysis, was used for the task of line segmentation. Finally, for the information extraction for cell position and line geometry an information retrieval on tables method without segmentation based in [97] was used. The two latter tasks were evaluated according to precision, recall, and F1 scores.

3.3 Content analysis datasets

Content is a fundamental part of a document, as it contains the semantics that make a document perceivable to humans. After categorizing a document and detecting its geometric structure, document understanding follows. The mapping of the layout structure into a logical structure is the understanding of the document that is followed by the content analysis. This section includes 35 studies related to OCR tasks, whether they target isolated characters, words, lines, digits, or whole document transcriptions, writer identification, reading order, and any type of retrieval that could refer to writer, image, or word spotting. Two datasets related to handwritten music recognition were further detected [70, 144], but are not included in detail in the scope of this work.

3.3.1 GERMANA database

GERMANA [141] is a database of 764 scanned pages from an 1891 manuscript written in Spanish. The pages contain 21K text lines and 217K words. Catalan, French, Latin, German, and Italian may also appear in some parts of the text. The ground truth comprises bounding box annotations for the text blocks, straight baselines for every text line, and line-by-line transcriptions. Although the database annotations contain both layout and text information, the baseline experiments were limited to the task of handwriting recognition. The transcription WER per block for handwriting recognition experiments were presented using a system that combines Hidden Markov Models for text recognition and n-grams for language modeling [179]. The results contained the first 180 pages of the database separated into blocks of 20 pages and were presented by adding each block consecutively. A 37% WER was achieved for the last two blocks, while the error is higher in the first blocks, where more out-of-vocabulary words were presented.

3.3.2 RODRIGO database

The RODRIGO database [159] contains data derived from a manuscript written in 1545 in old Castilian by one writer. The database follows a similar strategy as the GERMANA database in creation, ground truth, and experimental baseline. It includes 853-page images of one column text blocks, where each block is annotated with a bounding rectangle, and then each line within it with the corresponding baseline. The annotations further include transcriptions for every line, which results in a total of 20,357 text lines and 231K words as ground truth. Baseline results were provided for the task of handwriting recognition using the same model and processes as in Sect. 3.3.1, by using 20 blocks of 1K lines, and a WER of 36.5% was achieved on the last block.

3.3.3 IAM-HistDB

IAM-HistDB [54] is a highly used database of handwritten historical manuscript images that contains three datasets: Saint Gall, Parzival, and George Washington (GW). We present these datasets in the following paragraphs.

The Saint Gall database [55] is a set of 60 page images and 1,410 binarized and normalized text-line images of manuscripts written in the ninth century in Latin language and Carolingian script by one writer. The text edition for every page image was provided. The pages are composed of 11,597 words, 4,890 word labels, 5,436 word spellings, and 49 letters. The ground truth includes the line-level text transcriptions and the word and line locations. An evaluation of a transcription alignment system based on HMM is proposed in the paper and compared with three more reference systems.

The Parzival database [56] provides handwritten documents from the thirteenth century originating from three writers and was written in Old German and Gothic script. It contains 47 pages, 4477 text lines, 23,478 words, 4934 word categories, and 93 letters. Similar to St. Gall, the line and word images are binarized and normalized. As ground truth, Parzival includes line- and word-level transcriptions. The work presented in [53] used a HMM-based system similar to [111] and the BLSTM introduced in [67] recognizer for automatic handwriting recognition on the Parzival dataset and achieved a word Acc of 88.69% and 93.32%. Furthermore, in [56], a lexicon-free word spotting method based on character HMMs was proposed and evaluated on the Parzival and GW datasets.

The GW database [56] is comprised of eighteenth century documents from the George Washington Papers and contains 656 text and 4894 word images, binarized and normalized, along with their transcription annotations. The pages are written in English by two writers in longhand script. The dataset statistics also include 1471 word classes and 82 letters. This dataset is widely used to evaluate word spotting algorithms. [58] used this database and compared a proposed word spotting method that used a BLSTM and a modified CTC algorithm with a HMM [149] and a DTW [145] method. The paper presented average precision results using GW and Parzival datasets and the proposed method achieved 0.84 on the GW and 0.94 average precision on Parzival.

3.3.4 ESPOSALLES

The ESPOSALLES database [151] is a collection of ancient marriage license documents separated into the LICENSES and the INDEX subsets. A single-writer book written in old Catalan is the main content of the LICENSES set and is comprised of 173 pages and 1,747 licenses. For every page, the subset includes the main text block bounding box, the text line within the text block coordinates, the license label, and the transcription for every line, word, and character in the main block. The INDEX subset, which includes 29 pages of the initial indexes of two volumes by a single writer created between 1491 and 1495. Similar to the LICENSES part, INDEX contains text and line layout as well as transcription annotations. Both subsets provide dataset splits for cross-validation. Finally, baseline results using Hidden Markov Models (HMM) [180] and a BLSTM [67] with two feature sets, PRHLT [180] and IAM [111], showed the efficiency of neural networks with larger datasets for handwriting recognition. The database was further used in the ICDAR17 Competition on Information Extraction in Historical Handwritten Records [57]. The aim there was to detect and assign name entities to semantic categories (name, surname, occupation, etc.) for two Tracks: Basic and Complete, which also contains the person (husband, wife, etc.). The team that obtained the highest average score with word-level segmentation used a ResNet-based unigram system for character recognition and named entity recognition, while with line-level segmentation, the best method used a RNN-LSTM with connectionist temporal classification (CTC).

3.3.5 BH2M

The Barcelona Historical Handwritten Marriages Database or BH2M [51] consists of 174 handwritten marriage record pages, where 100 pages are meant for training, 34 for validation, and 40 for testing. The included pages were written in Old Catalan by a single writer between 1617 and 1619 and preserved in the Barcelona central archives. The database provides the ground truth for layout analysis, text transcription, and semantic analysis. XML annotation files are organized hierarchically into text blocks, segmented lines, and text words for layout analysis. The additional word transcriptions and semantics about the license, appearance order, date, and information about the wife and husband, may enable handwritten text recognition, word spotting, information extraction and understanding and context-aware algorithms. Moreover, baseline results of line segmentation [119] using the DR, RA, and FM metrics, and segmentation-free [183] and -based [6] word spotting algorithms using the mAP were presented.

3.3.6 HADARA80P

The HADARA80P dataset [128] contains 80 handwritten Arabic pages originating from a single-author book about the taaum disease and its connections to religion. The XML ground truth files provide the pages, text block, word coordinates, and transcription for every word. In some cases, tag values accompany the words. The total number of labeled words is 16,720. Experiments using a publicly available word spotting applicationFootnote 6 are presented and an extension of the methods used in the application [100, 101], the HADARA word spotter, is proposed. The original methods work by locating the zones of interest through gradients, while the proposed method employs curvature according to a threshold instead. The resulting mAP based on the precision measures \(p_{IR}\) and \(\overline{\gamma _{LA}}\), presented in [127], showed that the proposed system outperformed the already existing application on the HADARA80P and the George Washington datasets.

3.3.7 DocExplore

DocExplore [48] is a pattern spotting dataset that contains 1.5K images with more than 1.4K queries. The images originate from 6 different manuscripts written between the tenth and sixteenth centuries. The annotation process ends with 1,464 labeled objects belonging to 35 graphical object categories, where one sample constitutes the query image and the remaining objects from every category avail as retrieval outcomes. The dataset was proposed for two tasks: image retrieval and pattern localization. As baseline for the latter task a system that consists of an offline, online, and post-processing step initially presented in [49] is used. In the offline step, the background is removed, a descriptor is used to find the object regions of interest, and finally a Vector of Locally Aggregated Descriptors (VLAD) is created. Then, during the online step, a similarity distance calculation is performed between the extracted regions and the query image, then ranking was achieved through template matching. The system achieved a 0.613 mAP for retrieval and a 0.111 for localization, while further results on each category were shown.

3.3.8 AMADI_LontarSet

AMADI_LontarSet [86] is a collection of palm leaf manuscripts from Bali. This dataset was a part of the ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts [20]. It contains binarized, word annotated, and isolated character annotated ground truth images used for the following challenges: Binarization of Palm Leaf Manuscript Images, Query-by-Example Word Spotting on Palm Leaf Manuscript Images, and Isolated Character Recognition of Balinese Script in Palm Leaf Manuscript Images, respectively. For Challenge 1, binarization, the dataset includes 50 training images, 100 binarized images from two different sources (50 and 50) as ground truth, and 50 test images. The team that outperformed the others used a pre-trained FCN on handwritten documents as presented in the work by Wolf et al. [185] that was fine-tuned on the DIBCO [60] and H-DIBCO [138] images and then fine-tuned on the competition images. The results were evaluated according to the F-measure (FM), PSNR, and Negative Rate Metric (NRM) between the ground truth and the predicted binarized images. For Challenge 2, word-spotting, a split of 130 train and 100 test images was provided along with 15,022 word annotated patches for training. Moreover, 36 word annotated patches were given as query test. The goal was to use a query word image patch to retrieve similar word image patches in palm leaf manuscripts; however, there were no submissions for this challenge. Finally, Challenge 3 aimed to recognize isolated Balinese characters distributed over 130 character classes. The training set contains 11,710 labeled patch images, and the test set contains 7673. The method with the highest RA (VMQDF) initially pre-processed the input images by resizing, binarizing using the OTSU method, and then defeating grayscale variation. Then, synthetic samples were generated based on the pre-processed samples using the method in [165], gradient features were extracted for all images. Finally, a classifier was trained on the new set that contained the original and generated images, while at the test phase, for every sample, 97 synthetic images were generated and treated according to the previously mentioned method.

3.3.9 SleukRith

SleukRith [181] is a dataset of 657 images from palm leaf manuscripts written in Khmer from 4 different sources. This dataset includes annotations for isolated character recognition and word and line segmentation. The most valuable aspect of this dataset is character recognition, which is the foundation for building the other two elements. The individual character images were constructed by cutting patches for every character and removing the noise of near characters using inpainting. For the rest of the annotations, the combination of the characters was used to determine the words and lines. To evaluate the set for character recognition, the CER of a CNN, which was 6.04%, was presented. The dataset was further used in the ICFHR2018 Competition for Southeast Asian Palm Leaf Manuscripts presented in Sect. 3.3.19, which contained binarization, text-line segmentation, character recognition, and word transliteration tasks. However, this dataset was not part of the binarization task. The winning systems presented in the competition section also performed the best for this dataset alone.

3.3.10 VML-HD

VML-HD [85] is a database of Arabic handwritten documents that includes 680 pages from 5 different books of different writers. This database can be used for handwriting recognition and word-spotting. The annotations of the dataset include the book and page number, the segment id, bounding box coordinates for 121,636 sub-words and 244,553 characters, length of subword, and Arabic and Latin symbol transcriptions in Hadara XML format. Word spotting results using radial descriptor [83] and radial descriptor graph [84] on a subset from every book and the 5 books combined were presented. The Top1–Top5 DR of the radial descriptor graph method showed better performance on the combined set than the radial descriptor.

3.3.11 CFRAMUZ

The CFRAMUZ dataset [11] includes grayscale image pages from handwritten novels by Charles Ferdinand Ramuz in French between 1910 and 1946. Text and XML annotation files contain the unique word ID, coordinates, width and height of word bounding boxes, word line number, word number in the current line, and word transcription for word spotting without segmentation purposes. The following methods were evaluated according to Precision-Recall curves: Word Spotting and Recognition with Embedded Attributes (EAWS) [7], Efficient Exemplar Word Spotting (EEWS) [6], Bag-of-Visual-Words Word Spotting (BoVWWS) [153], and Fisher Kernels Word Spotting (FKWS) [130]. The mAP of these algorithms were compared on the introduced datasets with the performance on the George Washington (GW) and the Lord Byron (LB) datasets. Although this is a single-writer dataset, some variation in terms of writing style occurred due to the year range. Therefore, additional experiments using splits according to style were conducted.

3.3.12 Lontar Sunda

The Lontar Sunda dataset [172] is a collection of fifteenth century Sundanese palm leaf manuscripts from Garut, West Java, and Indonesia. This dataset includes 66 pages with corresponding binarization, word-level, and character-level annotations. Lontar Sunda was one of the datasets used in the ICFHR 2018 Competition on Document Image Analysis Tasks for Southeast Asian Palm Leaf Manuscripts [87]. This competition hosted 4 challenges: A. Binarization, B. Text-line segmentation, C. Isolated character/glyph recognition, and D. Word transliteration. As the original dataset paper did not include any benchmark results, the competition results are considered. For Challenge A, systems were evaluated according to the FM, Peak SNR (PSNR), and Negative Rate Metric (NRM). The best performing system on the Sundanese data used Gaussian operators and a nonlinear function to enhance the images. Then, the enhanced images were finally segmented with a threshold of 0.9. In Challenge B, the system evaluation was made using the DR, the RA, and the FM. The system with the best values on the Sundanese collection, which was also the only submission for this task, used the binarized images from Challenge A and horizontal projection profile to perform line segmentation. The character recognition challenge (C) was evaluated according to the recognition rate, and the highest value was obtained by a dense 100-layer CNN architecture that classified similar characters. Finally, in Challenge D, the best performing system achieved an 8.81% CER on the Sundanese set using a CNN-RNN encoder–decoder architecture with an attention mechanism.

3.3.13 ICDAR 2017 competition on recognition of early Indian printed documents (REID2017)

The REID 2017 Competition [33] held at ICDAR 2017 includes 26 evaluation images written in Bengali from 1785–1909 and an example set of 5 images for training. The competition originally held two tasks, the Bengali text recognition and the Quarterly Lists challenge (tabular recognition in English and Bengali); however, there were no submissions for the latter challenge. The organizers of the competition provided the image annotations in PAGE XML format [132] created using Aletheia [31]. These annotations included layout region polygons, metadata such as heading, paragraph, captions, footer, etc., and reading order information. The Google Multilingual OCR that uses the Google Cloud vision APIFootnote 7 achieved the highest flex Character Accuracy (CA) compared to the other submissions; however, the CA of 75.4% suggests that there is plenty of room for improvement. The same system, with a success rate of 78.4%, outperformed the other systems on the text region segmentation task.

3.3.14 ICDAR2017 competition on historical document writer identification (Historical-WI)

The Historical-WI competition [52] focused on image retrieval based on writer identification. The competition offered a set of 3,600 images of handwritten document pages ranging from the thirteenth to the twentieth century for evaluation. The test set originated from the Universitätsbibliothek Basel and included 720 different writers. For training, 1182 images in color and binary format from 394 writers were provided and were different from the writers in the test set. The submitted systems were evaluated using the mAP metric. The system that achieved the highest mAP used feature vectors derived from binarized samples and the concatenation of their oriented basic image feature (BIFs) columns histograms [63, 122].

3.3.15 Kuzushiji

The full Kuzushiji dataset [30] consists of three parts: the Kuzushiji-MNIST, the Kuzushiji-49, and the Kuzushiji-Kanji. The whole dataset is comprised of printed books from the eighteenth century written in cursive Japanese or Kuzushiji. The K-MNIST subset includes 70K 28\(\times \)28 grayscale images of 10 Kuzushiji character classes to resemble the MNIST and Fashion-MNIST datasets but is even more challenging. Kuzushiji-49 contains 270,912 images of the same pixel resolution and mode as K-MNIST, including 49 character classes. Finally, Kuzushiji-Kanji is a subset of 140,426 64\(\times \)64 grayscale images of 3832 Kanji characters. The two latter subsets are considered quite imbalanced. Benchmark results on the K-MNIST and Kuzushiji-49 were presented using a 4-nearest neighbor classifier, a 2-layer CNN, ResNet-18 [73], ResNet-18 with input mixup [191], and ResNet-18 with manifold mixup regularizer [182]. The performance of these models were compared using MNIST. All models had the highest test accuracy on the MNIST test set, followed by K-MNIST, and finally Kuzushiji-49. The best performing model for the K-MNIST and Kuzushiji-49 test sets was ResNet-18 with manifold mixup, while for MNIST, it was the simple ResNet-18 model. Domain transfer was further explored from Kuzushiji-Kanji to modern Kanji (stroke format). Two variational autoencoders [90, 148] were used to create the old (Kuzushiji) and new (Modern) latent space embeddings. Then, a Mixture Density Network [14] predicted the probability of the new embedding given the old embedding. Finally, a Sketch-RNN [69] conditioned on the new latent space created modern Kanji stroke image versions of Kuzushiji.

3.3.16 MHDID

MHDID is the multi-distortion historical document image database [164] for document quality assessment and distortion classification. This dataset contains 335 images with four degradation types: wormholes, stains, reader annotations, and paper translucency. The document images emanate from 130 books from the Qatar University Library and are written in Arabic. Several users are supposed to compare pairs of images and select among three options. These options are “The left image is better,” “The images are similar,” or “The right image is better.” With six outliers removed, the user interface results were normalized between 0, the lowest perceptual quality value, and 9, the highest. Finally, the MOS value was computed for every image, which is the sum of the outcome pair comparisons divided by the number of pairs. A dataset analysis was further demonstrated in terms of color and spatial information to reveal the heterogeneity of the dataset. This database seems to be an outlier. Thus, we categorize it as retrieval in Table 1 with a (\(\checkmark \)) since it is a database that compares pairs of images.

3.3.17 Tripitaka Koreana in Han (TKH) and Multiple Tripitaka in Han (MTH)

The work presented in [189] introduces two datasets for Chinese character detection and recognition, the Tripitaka Koreana in Han (TKH) and the Multiple Tripitaka in Han (MTH), created using the publicly available TKH images from the Tripitaka Koreana Institute. For every character bounding box that the dataset includes, a character label was further provided. The TKH consists of 1K pages, 23,471 lines, 323,491 characters, and 1471 character classes, while the MTH contains 500 images, 17,178 lines, 197,886 characters, and 3664 character classes. The two datasets differ in terms of challenge, as the MTH dataset has a more complex character size uniformity, making the creation of bounding box annotations even harder. A three-part pipeline called Recognition Guided Detector (RGD) was proposed. First, segmentation is performed for every line. A Recognition Guided Proposal Network (RGPN) generates context information, and finally, a detector uses that information to find the characters in every line. Finally, several experiments were performed on the two datasets using the proposed system with and without a VGG-16 [170] backbone and its performance was compared with other well-known object detection frameworks using either the whole image or text lines as input. This method seems to perform comparably to other methods using fewer parameters.

3.3.18 ICFHR 2018 competition on recognition of historical Arabic scientific manuscripts (RASM2018)

The RASM 2018 competition [34] was part of ICFHR 2018 and targeted the recognition of Arabic historical scientific manuscripts through three tasks: page segmentation, text-line detection, and OCR. An example set of 15 single-column page images with PAGE XML ground truth format [132] was provided for training and 85 for evaluation to handle these tasks. As in similar competitions, the ground truth included polygon regions, text transcriptions, and metadata for each region, such as headings, paragraphs, captions, footers, and reading order. For system evaluation, the competition used the success rate and errors for the page and text-line segmentation predictions and the CA for the OCR. For page segmentation, the winning system used an FCN applied on extracted patches. The page layout predictions were further cropped for the text-line segmentation, which was performed at pixel-level using anisotropic Gaussian smoothing. The highest performance was achieved for the rest of the tasks by a Historical Arabic Handwritten/Typewritten OCR system framework. This system can handle various fonts and layouts, and in the case of the competition, an instance-based segmentation on extracted lines was performed.

3.3.19 ICFHR 2018 competition on document image analysis tasks for Southeast Asian palm leaf manuscripts

The Southeast Asian palm leaf manuscripts competition [87] offered four tasks: binarization, text-line segmentation, isolated character/glyph recognition, and word transliteration. This competition included manuscripts in three languages: Balinese, Khmer, and Sundanese. For Balinese, Khmer, and Sundanese language sets, the Amadi_Lontarset dataset presented in Sect. 3.3.8, the SleukRith dataset presented in Sect. 3.3.9, and the Lontar Sunda dataset presented in Sect. 3.3.12, respectively, were utilized. The results for each separate language set are presented in the corresponding dataset sections. In this section, we will present the overall results of the competition. The system that obtained the highest FM and PSNR values for binarization used Gaussian operators and a nonlinear function to enhance the images. Then, the enhanced images were finally segmented with a threshold of 0.9. The same system achieved a value of 0.17 NRM that underperformed the best value by only 0.01. In challenge B, text-line segmentation, the best and only DR, RA, and FM values were achieved by using the binarized images from Challenge A and the horizontal projection profile to perform line segmentation. Challenge C, character recognition, was evaluated according to the recognition rate, and the highest value was obtained by using a dense 100-layer CNN architecture that classifies similar characters. Finally, in Challenge D, the best performing system achieved a 5.62% CER on the mixed sets using a CNN-RNN encoder–decoder architecture with an attention mechanism.

3.3.20 ARDIS

Arkiv Digital Sweden (ARDIS) [94] is a handwritten digit dataset collection derived from historical church records. ARDIS contains four different datasets: Dataset I, which contains 10K 4-digit string images that represent a year, Dataset II, which contains single digits of classes 0-9, Dataset III, which is the same as Dataset II but is cleansed from noise, and Dataset IV, which is the same as Dataset III but is in grayscale and is similar to the highly used MNIST Database [98]. Dataset II - IV contain 7,600 digit images. Several experiments using CNN, SVM, HOG+SVM, k-NN, random forest, and RNN classifiers were presented. The results of training using the MNIST and USPS [77] datasets and testing on ARDIS reveal the diversity of the ARDIS dataset as the highest RA obtained reaches \(58.80\%\), while training and testing on ARDIS gives a performance of \(98.6\%\). In all experimental cases, the best performance was achieved by the CNN digit classifier.

3.3.21 OBC306

OBC306 [76] is a dataset of 309,551 images for Oracle Bone character recognition distributed across 306 character classes. This dataset consists of patch samples derived from different sources from full image publications of oracle bones. For patch extraction, an oracle bone character list and dictionary were used to retrieve all characters and extract them from the source images to assign them to a class and a specific encoding. The challenges faced in the dataset are the class imbalance and the numerous variants of each character. The evaluation results of widely used CNN architectures [72, 92, 170, 173], and a classical method of HOG descriptors with SVM [41] were presented, and Inception-v4 [173] achieved the best performance. Although the dataset is hand carved, we characterize it as handwritten in Table 1 for homogeneity reasons.

3.3.22 GRK-Papyri & PapyRow

The GRK-Papyri dataset [117] provides 50 handwritten Greek papyrus images from the sixth century A.D for writer identification. It includes color and grayscale documents with 4-7 samples each from 10 different writers. This dataset provides a leave-one-out option that contains all images without any split and a train-test split with a balanced training set of 20 samples, keeping the rest with different numbers of samples per writer for testing. The dataset has a high complexity, as the images are heavily degraded and low in quality, making pre- and post-processing inevitable. Due to size limitations, the method used for evaluation was a normalized local Naïve Bayes nearest-neighbor (NBNN) classifier with FAST keypoints [116]. The authors suggested that this dataset can be further used for image processing tasks or line/word segmentation. An extension of the GRK-Papyri is presented in [29], where enhancement techniques, such as background smoothing, line resizing, and image rotation, were used to obtain images with less degradation. In this extended version of the dataset, named PapyRow, 6,498 images were obtained using a row segmentation method and included with their corresponding XML ground truth.

3.3.23 CASIA-AHCDB

CASIA-AHCDB [188] is a database of 11,937 handwritten Chinese document pages. For the task of character recognition, the database provides 2.2M handwritten characters belonging to 10,350 different classes. The database distributes these elements across two different datasets: (a) the Complete Library in Four Sections (AHCDB—style1) and (b) the Ancient Buddhist Scriptures (AHCDB—style2). Then, each dataset is split into a Basic Category Set (BC) for basic character recognition, an Enhanced Category Set (EC) for open-set character recognition, and a Reserved Category Set (RC) for other recognition purposes. To benchmark the database, a CNN [192] and a Convolutional Prototype Network (CPN) [190] were used and experiments were performed with only the Basic Category Set and with the combination of the Basic and Enhanced Category Sets for every dataset. Moreover, the transfer of information from the style1 to the style2 dataset with direct train-test and fine-tuning was attempted, which performed best among the two methods.

3.3.24 Amharic text image recognition database

The Amharic database [12] presents a collection of 40,929 printed images with Amharic script text lines originating from pages of different documents written in Amharic and 296,403 synthetic images created using OCRopus [16]. The generated synthetic images include the Power Geez and the Visual Geez fonts. A Bidirectional Long Short Term Memory (LSTM) followed by a softmax function that produced 281 probability values, which is the number of unique characters in the database, and a CTC output layer were proposed for text-line recognition. This method achieved an 8.54% CER for the printed Power Geez documents, 2.28% CER for the Visual Geez synthetic images and 4.24% CER for the Power Geez synthetic images.

3.3.25 ICDAR 2019 historical document reading challenge on large structured Chinese family records (ICDAR19 HDRC-Chinese)

This competition [155] presented a database of approximately 10K historical Chinese family record pages to evaluate systems for the tasks of (1) text recognition on extracted lines, (2) pixel-level layout analysis, and (3) text-line detection and recognition. More specifically, the training set includes 11,715 pages derived from 37 different books along with their PAGE XML [132] and pixel-wise annotations, while the test set includes 1,135 images from 12 books. To evaluate the submitted systems, Task 1 uses the edit distance (editDistance), Task 2 uses the mIoU, and Task 3 uses the total counted errors (totalErrors) of the output XML file. The team that achieved the best results for all tasks used a Convolutional Recurrent Neural Network (CRNN) [167] to recognize Chinese text, a Cascade R-CNN [21] to detect text lines, and a U-Net-shaped network for the pixel-wise classification. For Task 2, the system that outperformed the others achieved a 99.96% IoU for the background class and 99.24% for the text class.

3.3.26 ICDAR 2019 competition on recognition of early Indian printed documents (REID2019)

The REID2019 competition [35] is an extension of the previously mentioned REID2017 competition (Sect. 3.3.13). This competition provided 25 labeled images with the same annotation format and content as the previous competition and a balanced test set of 56 images written in English and Bengali. This competition hosted two tasks, layout analysis and text recognition, but focused mostly on Bengali text recognition. Again, the Google Multilingual OCR achieved the highest flex CA in the text recognition task and success rate in the text region page segmentation. The results from this competition were slightly better than those in 2017 but still remain quite low, and the authors suggested a focus on pre-processing for better performance.

3.3.27 DIDA

The Digit Dataset DIDA [95] is an extension of the previously mentioned ARDIS digit dataset. DIDA is composed of three datasets: Dataset I, with 250K single-digit color images of 10 classes (0-9), Dataset II, with 200K multi-digit year string samples, and Dataset III, with 25K digits with bounding boxes meant for object detection. A digit detection and recognition system named DIGITNET was proposed that initially detects handwritten digits and passes the output to a recognition network to classify them. The results of this system were further evaluated on DIDA, comparing it to other classical methods [23, 64, 115] and network architectures such as YOLOv3 and YOLOv3-tiny [146]. Similar to ARDIS, several experiments with different combinations of datasets were performed and state-of-the-art results in digit detection were achieved.

3.3.28 ScribbleLens

In [46], a corpus for automatic manuscript transcription was presented. This dataset contains 1K pages from early modern Dutch manuscripts spanning over 150 years with line, character, year, and writer ground truth. It further provides a set of unlabeled images for unsupervised or weakly-supervised learning investigation. As a baseline, a network that combines Convolutional Neural Networks and bi-directional LSTM with a CTC (CNN/BLSTM/CTC) [125, 184] was used, and it was shown that the CER would be reduced in the presence of additional annotated data.

3.3.29 ICDAR 2019 competition on recognition of historical Arabic scientific manuscripts (RASM2019)

The next RASM competition after the one presented in Sect. 3.3.18 is the ICDAR RASM19,Footnote 8 which focused on the recognition of archival Arabic scientific manuscripts. This version of the competition offers 20 training images with PAGE XML annotations [132] and 100 test images to evaluate the systems. The ground truth has the same format and content as the previous competition for the three tasks: text block detection, text-line detection/segmentation, and text recognition. Although the competition did not provide any detailed information due to the absence of a published paper, a graph was provided for every task containing the results of the submitted systems. A Google submission shows the highest success rate for the first task and an RDI system shows the highest success rate for the second task. For text recognition in normalized text, a 77.58% flex CA is achieved again by the RDI system. We suspect that the RDI winning systems are the same as those in the previous round of the competition, however it is not clear in the competition’s website.

3.3.30 ICDAR 2019 competition on image retrieval for historical handwritten documents (ICDAR19-HDRC-IR)

This competition [28] followed the previous competition mentioned in 3.3.14 and handled the task of image retrieval according to writer style by providing a larger test set of 20K images from over 10K different writers. For training, the competition proposed the dataset from the previous competition and further enlarged the training with images from Letters A and Manuscripts. The mAP constituted the evaluation metric, similar to the previous competition. The winning system obtained a 92.5% mAP using SIFT [107] and Pathlet features [96] projected into a lower dimension space using SVD on the ICDAR17 Historical-WI data feature matrices and then concatenated and normalized them to compute global descriptors using Euclidean distance.

3.3.31 Handwritten text recognition (HTR) benchmarks

The work published in [177] presents four benchmarks for historical document HTR and achieves state of the art results for four different competitions: ICFHR-2014 [174], ICDAR-2015 [175], ICFHR-2016 [156], and ICDAR-2017 [176]. The ICFHR-2014 dataset is a subset of the Bentham Papers [22] that contains 433 images with line detection and recognition ground truth in PAGE XML format. Similarly, the ICDAR-15 competition contains Bentham page images, but presenting a more difficult layout than those of ICFHR-2014. This dataset consists of different subsets that include line images with their corresponding line transcriptions aligned, or images with page-level transcriptions, but no alignment. The ICFHR-2016 dataset includes 450 single-block page images derived from the German Ratsprotokolle collection, that contain approximately 10K lines and 43K running words. The provided ground truth is at line-level. The three mentioned competitions include a Restricted and an Unrestricted Track. Finally, the ICDAR-2017 competition provides 10,172 images distributed across two training and two test subsets. The data provided come from the Alfred Escher Letter (AEC) and other German collections and present heterogeneous writing styles. The competition includes a Traditional challenge for simple transcription and an Advanced challenge for transcription, but with a pre-step of line detection. For the benchmarking, a CRNN with four convolutional and three recurrent layers is used for character optical modeling and enhanced with the use of N-gram language models on the output character probabilities. With this enhancement, the work achieves the lowest CER and WER for all cases.

3.3.32 ICFHR 2020 competition on image retrieval for historical handwritten fragments (HisFragIR20)

Another competition, which is similar to the ICDAR17 Historical-WI (Sect. 3.3.14) and the ICDAR2019-HDRC-IR (Sect. 3.3.30) is the HisFragIR20 [162]. This competition further increased the size of the dataset by generating 120K fragments, randomly shaped and rectangular, from 20K documents and 9.8K writers. The test data come from European Middle Age books (ninth to fifteenth century CE). Fragments extracted from the ICDAR2019-HDRC-IR test set comprise the training set. The competition evaluated the test set for two tasks, retrieval per writer and per image. For the writer task, the best system in terms of mAP used a ResNet [72] with 20 layers trained on SIFT keypoints and multi-VLAD encoding, PCA for descriptor dimensionality reduction, k-means clustering on the descriptors, and cosine similarity for the final results. The whole process was based on the work presented in [27]. Accuracy, Pr@10, and Pr@100 metrics were used. The system that achieved the highest values used a ResNet50 [72] feature extractor with whole fragment image input and the \(\chi ^2\) distance. This system also obtained the highest values in all retrieval per image task metrics.

3.3.33 Digital Peter

Digital Peter [110] is a dataset of 9694 images and their corresponding text from manuscripts, written by Peter the Great from 1709 to 1713 for handwriting recognition. This dataset provides a 6237 training, 1930 validation, and 1527 test splits that can be used either for line segmentation or line recognition. A competitionFootnote 9 on text-line recognition was launched using this dataset. As a baseline, a 7-layer CNN was used for image feature extraction and then a bidirectional GRU network with CTC loss [66] was used to predict the image text. The model performance was further optimized using different hyperparameter values and beam search. The task was evaluated according to the CER, the WER, and the string Acc.

3.3.34 Hugin-Munin

The Hugin-Munin dataset [108] is the first handwritten recognition dataset for text written in Norwegian. This dataset contains images derived from diaries and private correspondences written from 1820 to 1950 from 12 different writers. The ground truth includes the transcriptions of 164,922 words or 23,732 lines in PAGE XML format. The authors provide a 80% training—10% validation—10% test random split and another split with 3 unseen writers in the test set. They further present a survey of open-source handwritten text recognition libraries used since 2019 and compare the performance using the CER and WER on the random split data. The lowest CER is obtained using PyLaia [140], while the lowest WER is obtained using Kaldi [9]. These best methods were further deployed on the writer split and achieve much lower performance than the random split.

3.3.35 POPP

The POPP dataset [39] contains lines extracted from Paris census tables of 1926 and consists of three sub-datasets: the “Generic dataset,” the “Belleville,” and the “Chaussée d’Antin.” The Generic dataset contains 80 double page images, one for every Paris district, each one from a different writer, and 4800 lines divided in 3840 train, 480 validation, and 480 test lines. The Belleville dataset contains 49 pages and 1470 lines from the Belleville district written from a single writer. The Chaussée d’Antin is a 10-writer set of 780 lines and 26 pages from the Chaussée d’Antin census. POPP includes grayscale images, their corresponding line bounding boxes in XML files, and the line labels in json format. This work presents line recognition results of the CER and WER for each of the three datasets using an end-to-end hybrid attention network [40]. Finally, the paper presents a complete pipeline, with the steps of pre-processing, handwriting recognition, and domain knowledge integration, that extracted a vast number of information from the Paris census to be used as annotated dataset and improves the CER with the use of self-training.

4 Observations and discussion

Several datasets exist for the tasks in the three categories that we presented in this paper: document classification, document structure, and content analysis. There is a variation in languages, tasks, and sizes; howeves to exist that can address various tasks and be used by the community for pre-training or transfer learning. Various evaluation methods are also presented when benchmarking. Nevertheless, it is difficult to directly compare datasets and techniques, as there is no universal evaluation that can directly compare the performance of systems on datasets.

The classification of objects on a page level is highly represented in document structure tasks. Nevertheless, a document could also be considered as the whole manuscript collection. We found six studies related to document classification, which means that this task is rarely addressed. Only one dataset offers more than 35K pages, but the overall amount is relatively low. The main focus of datasets is on Latin scripts, while others such as Arabic are also represented. Some scripts, such as Hebrew or Greek, are rarely represented. We detected more meta-data information in several datasets that we categorized in the document structure and content analysis sections, as they are not used or included in the benchmarks.

Considering the document structure studies, we found only two datasets containing more than 10K images. Again, there is a significant focus on Latin scripts; however, more languages are observed for this task as it is much more represented than document classification. A noticeable issue, in this case, is the comparison across databases as a variety of evaluation measures and benchmarks are used. We propose harmonizing the evaluation metrics using the mIoU and mAP metrics (at 50%, 60%, 70%, and 80%). In terms of annotation format, we note that most datasets use PAGE XML, three datasets use the COCO format, and only one dataset uses the VOC format. It would be beneficial to establish a conversion between annotation formats to promote computer vision models for historical document analysis.

The content analysis task seems to have the most prominent representation in the set of datasets. In this case, approximately 30% of the semantics-related studies include more than 1K images. The majority of this percentage appears for isolated character recognition, which is reasonably the easiest case of samples one could obtain and manage in a database. In general, there is an emphasis on OCR, but the level of detail differs (character, word, or line). Retrieval further focuses on text on the word, image, or writer level. Likewise, there is a focus on Latin scripts. Still, there is also a high representation of Asian scripts and the least representation on Arabic scripts. Finally, there is more interest in paleography, but we lack the representation of digits and tables as content.

5 Conclusion

We demonstrated a survey of historical document image datasets following a systematic literature review methodology. We summarized 65 studies and clustered them considering the related general tasks that we defined. We list the datasets in a table, connecting them to their corresponding section, and mark the possible tasks they include. For every study, we tabulate detailed information about the statistics, tasks, document type, languages, input image visual aspects, annotations, and benchmark and quantitative performance analysis information. This way, we facilitate researchers in finding the most fitting datasets and enable historical document image analysis.

Table 2 Historical document image datasets with information about statistics, classes, tasks, language, document type, input visual aspects, ground truth, and benchmarks present in their original papers or competitions

Our findings unveil a focus on Latin scripts and several evaluation methods, but not much compliance with deep learning trends. A clear size limitation on dataset samples is also obvious. As future directions, we urge the need for large-scale datasets to apply state-of-the-art deep learning methods, the inclusion of more classification tasks using metadata information, and the harmonization of evaluation schemes for direct comparison across datasets.