Here is an example of file structure of the dataset for discipline math.GM.
math.GM
├── 0906.1099
│ ├── layout_annotation.json
│ ├── order_annotation.json
│ ├── page_xxxx.jpg
│ ├── quality_report.json
│ └── reading_annotation.json
└── 2103.02443
├── layout_annotation.json
├── order_annotation.json
├── page_xxxx.jpg
├── quality_report.json
└── reading_annotation.jsoneach paper folder, for example, math.GM/2103.02443 contains five parts:
page_xxxx.jpg, this image represents each page of the paper, the page index is contained in the filename. Notice that this might be different from the original paper.layout_annotation.json, this json file contains the layout annotation of each page in COCO format.reading_annotation.json, this json file contains Latex source code for each blocks (except Figure). Notice that the latex source code may contain macros.order_annotation.json, this json file contains the relationship between different blocks in triple format.quality_report.json, this json file contains the quality computing result for each page and the whole paper for further use.
| Index | Category | Notes |
|---|---|---|
| 0 | Algorithm | |
| 1 | Caption | Titles of Images, Tables, and Algorithms |
| 2 | Equation | |
| 3 | Figure | |
| 4 | Footnote | |
| 5 | List | |
| 7 | Table | |
| 8 | Text | |
| 9 | Text-EQ | Text block with inline equations |
| 10 | Title | Section titles |
| 12 | PaperTitle | |
| 13 | Code | |
| 14 | Abstract |
- The IoU of Bounding boxes are too large, this happens when the paper template is too complex.
- The category of the bounding boxes are not correct. This happens when user-defined macros are used. For example, some authors may use
\newcommand{\beq}{\begin{equation}},\newcommand{\eeq}{\end{equation}}, in this case, the equation may be detected asTextclass. - Bounding box is missing, this happens due to rare packages are used. Some rare packages may not identified by our rule-based methods.
- Bounding boxes are correct, but overlaps with other adjacent bounding boxe slightly, this happens due to layout adjustments, for example
vspace,inputcommands.
| Category | Description | Example |
|---|---|---|
| identical | two blocks corresponding to the same latex code chunk | paragraphs that cross columns or pages |
| peer | two blocks are both belongs to Title | \section{introduction}, \section{method} |
| sub | one block is a child of another block logically | \section{introduction} and the first paragraph in Introduction section |
| adj | two adjacent Text blocks | Paragraph1 and Paragraph2 |
| explicit-cite | one block cites another block with ref |
As shown in \ref{Fig: 5}. |
| implicit-cite | The caption block and the corresponding float environment | \begin{table}\caption{A}\begin{tabular}B\end{tabular}\end{table}, then A implicit-cite B |
each reading_annotation.json contains two field:
annotations: containing the block information for each block, theblock_idof each block is used to represent the relationship.orders: containing a list of triples, the meaning of each triple is:type, representing the category of the current relationship, see table above for details.from, representing theblock_idof the starting block of the relationshipto, representing theblock_idof the ending block of the relationship
reading_annotation.jsonfile of some papers may not contain the fieldannotationsfor unknown reason.reading_annotation.jsondoesn't contain theimplicit-citerelationship, theimplicit-citerelationship is used in test-dataset for efficiency consideration.explicit-citeonly supportsEquation, the support forTable,Figrueis developed after the training dataset is complete.
This file containing the rule-based quality check for further use. Explanation is as follows:
-
num_pages: the number of pages of the paper. -
num_columns: 1 (single column) or 2 (two column), depends on the last page of the paper -
category_quality: we record the number rendered latex code chunks for each categoryreading_count, and the number of detected bounding boxesgeometry_count, thenmissing_rateis computed as(reading_count - geometry_count)/reading_count. Finally, theTotalcategory is the summary of all other categories. -
page_qualitycontaining IoU information of each page and the whole paper:-
page: page index -
num_blocks: how many bounding boxes in this page -
area: sum of area of all blocks,$\sum_i \text{area}(\text{bbox}_i)$ -
overlap: sum of intersection area of all blocks,$\sum_i\sum_{j>i} \text{area}(\text{bbox}_i\cap bbox_j)$ -
ratiothe ratio betweenoverlapandarea. Note that this ratio may be very large if there is template issue.
-