@JaredforReal I investigated this further from the GLM-OCR SDK side and found that the core model/layout support is not really the main question anymore. The practical maintainer decision now seems to be the output contract.
I was able to get the SDK to preserve number regions from PP-DocLayoutV3, OCR those regions, and derive structured printed page metadata such as:
page_number_candidates
document_page_numbering
page_metadata
For example, the extra metadata layer can look like:
{
"page_number_candidates": [
{
"page_index": 1,
"label": "number",
"content": "22",
"layout_index": 0,
"bbox_2d": [93, 26, 120, 41],
"layout_score": 0.77,
"numeric_like": true,
"roman_like": false
}
],
"document_page_numbering": {
"strategy": "visual_sequence",
"confidence": 1.0,
"sequence_type": "arabic",
"page_offset": 21,
"candidate_pages": 4
},
"page_metadata": [
{
"page_index": 1,
"printed_page_label": "22",
"printed_page_block_index": 0,
"printed_page_bbox_2d": [93, 26, 120, 41],
"printed_page_confidence": 0.77
}
]
}
The main open design question is now:
Would such metadata be acceptable only if the saved paper.json output becomes a top-level wrapped object, or would that be considered too breaking for downstream SDK users?
Concretely, the current output shape is effectively:
[
[...page 0 blocks...],
[...page 1 blocks...]
]
A metadata-friendly wrapped shape would be:
{
"json_result": [...], // the existing OCR/layout block output, grouped by file page index
"page_number_candidates": [...], // raw `number` region evidence found on pages, with OCRed content and bbox/score info
"document_page_numbering": {...}, // document-level inference, e.g. sequence type and inferred page offset
"page_metadata": [...] // selected per-page printed page labels derived from the candidates
}
For example, the extra metadata layers can look like:
{
"page_number_candidates": [
{
"page_index": 1,
"label": "number",
"content": "22",
"layout_index": 0,
"bbox_2d": [93, 26, 120, 41],
"layout_score": 0.77,
"numeric_like": true,
"roman_like": false
}
],
"document_page_numbering": {
"strategy": "visual_sequence",
"confidence": 1.0,
"sequence_type": "arabic",
"page_offset": 21,
"candidate_pages": 4
},
"page_metadata": [
{
"page_index": 1,
"printed_page_label": "22",
"printed_page_block_index": 0,
"printed_page_bbox_2d": [93, 26, 120, 41],
"printed_page_confidence": 0.77
}
]
}
So I would like maintainer guidance on which direction is acceptable:
-
Always-wrapped paper.json
- cleanest place to store additional metadata
- but changes the saved output contract for downstream users
-
Keep the current paper.json shape unchanged and save printed-page metadata separately
- avoids breaking downstream consumers
- but adds another artifact / sidecar JSON
-
Keep the feature disabled by default
- and only emit the wrapped structure when printed-page detection is explicitly enabled
- still a contract change, but opt-in
My current feeling is that many users of scientific PDFs would benefit from having real printed page numbers available for citation-oriented workflows, especially in RAG systems. But I do not want to move forward with an output-structure change unless maintainers are comfortable with it.
So the most useful feedback for me now would be:
- Is a wrapped JSON output acceptable for GLM-OCR SDK?
- If not, would a separate metadata file be preferred?
- If neither is acceptable, what output format would you prefer for exposing printed page number metadata?
@JaredforReal I investigated this further from the GLM-OCR SDK side and found that the core model/layout support is not really the main question anymore. The practical maintainer decision now seems to be the output contract.
I was able to get the SDK to preserve
numberregions from PP-DocLayoutV3, OCR those regions, and derive structured printed page metadata such as:page_number_candidatesdocument_page_numberingpage_metadataFor example, the extra metadata layer can look like:
{ "page_number_candidates": [ { "page_index": 1, "label": "number", "content": "22", "layout_index": 0, "bbox_2d": [93, 26, 120, 41], "layout_score": 0.77, "numeric_like": true, "roman_like": false } ], "document_page_numbering": { "strategy": "visual_sequence", "confidence": 1.0, "sequence_type": "arabic", "page_offset": 21, "candidate_pages": 4 }, "page_metadata": [ { "page_index": 1, "printed_page_label": "22", "printed_page_block_index": 0, "printed_page_bbox_2d": [93, 26, 120, 41], "printed_page_confidence": 0.77 } ] }The main open design question is now:
Would such metadata be acceptable only if the saved
paper.jsonoutput becomes a top-level wrapped object, or would that be considered too breaking for downstream SDK users?Concretely, the current output shape is effectively:
A metadata-friendly wrapped shape would be:
{ "json_result": [...], // the existing OCR/layout block output, grouped by file page index "page_number_candidates": [...], // raw `number` region evidence found on pages, with OCRed content and bbox/score info "document_page_numbering": {...}, // document-level inference, e.g. sequence type and inferred page offset "page_metadata": [...] // selected per-page printed page labels derived from the candidates }For example, the extra metadata layers can look like:
{ "page_number_candidates": [ { "page_index": 1, "label": "number", "content": "22", "layout_index": 0, "bbox_2d": [93, 26, 120, 41], "layout_score": 0.77, "numeric_like": true, "roman_like": false } ], "document_page_numbering": { "strategy": "visual_sequence", "confidence": 1.0, "sequence_type": "arabic", "page_offset": 21, "candidate_pages": 4 }, "page_metadata": [ { "page_index": 1, "printed_page_label": "22", "printed_page_block_index": 0, "printed_page_bbox_2d": [93, 26, 120, 41], "printed_page_confidence": 0.77 } ] }So I would like maintainer guidance on which direction is acceptable:
Always-wrapped
paper.jsonKeep the current
paper.jsonshape unchanged and save printed-page metadata separatelyKeep the feature disabled by default
My current feeling is that many users of scientific PDFs would benefit from having real printed page numbers available for citation-oriented workflows, especially in RAG systems. But I do not want to move forward with an output-structure change unless maintainers are comfortable with it.
So the most useful feedback for me now would be: