Multiple doc types by kpsherva · Pull Request #732 · CERNDocumentServer/cds-rdm

kpsherva · 2026-03-11T09:54:11Z

closes Inspire Harvester: Explore and implement a solution to map multiple document types as resource type #567

kpsherva · 2026-03-20T16:22:37Z

-            del new_version_entry["pids"]
-
-        # Preserve existing programmes in new versions
-        existing_programmes = (


check if the custom fields are preserved between versions

* change (harvester): skip update if metadata is equal # Conflicts: # site/cds_rdm/inspire_harvester/writer.py

refactor: exception handling # Conflicts: # site/cds_rdm/legacy/redirector.py # Conflicts: # site/cds_rdm/legacy/redirector.py

* refactor(harvester): add specialised classes to handle drafts, files, matching records

* change(harvester): usage of resource types

* change(harvester): add license mapping

TahaKhan998 · 2026-05-13T14:24:17Z

-            # error_message = f"Unexpected error while processing entry: {str(e)}."
+            import traceback
+            traceback.print_exc()
+        except Exception as e:


I think the generic except Exception case is a bit risky here. For WriterError and ValidationError we actually build an error_message and add it to stream_entry.errors, so the failed entry is tracked properly. But here we only print the traceback and then continue. So if _route() fails because of some unexpected bug, nothing gets added to stream_entry.errors, op_type can stay unset, and the function still returns the entry. That feels like it could leave it in a half-failed state without the failure being tracked clearly. Should we either add an explicit error there too?

TahaKhan998 · 2026-05-13T14:44:30Z

+        for doc_type in doc_types:
+            self.logger.debug(f"Mapping {doc_type} to version.")
+            resource_type = INSPIRE_DOCUMENT_TYPE_MAPPING[doc_type]
+            self.logger.info(f"Mapped {doc_type} to {resource_type}.")
+            if resource_type is not self.main_res_type:
+                version_ctx = MetadataSerializationContext(
+                    resource_type=resource_type,
+                    inspire_id=self.inspire_id,
+                    cds_rdm_id=self.cds_id,
+                )
+                mappers = self.policy.build_for(resource_type)
+                assert_unique_ids(mappers)
+                patches = [
+                    m.apply(self.inspire_record, version_ctx, self.logger)
+                    for m in mappers
+                ]
+


I might be misunderstanding this, but is there a risk of generating duplicate versions for the same target resource type here? split() loops over every raw INSPIRE document_type, and if two different doc types map to the same CDS resource_type, it looks like we would append two versions with the same target type. Then later the writer would process that same resource type more than once. I’m also wondering if the unused _PREPRINT_DOC_TYPES constant was meant to group some of these doc types together instead.

TahaKhan998 · 2026-05-13T14:57:03Z

+            source = abstract.get("source", "").lower()
+            if source and source not in ["arxiv", "cds"]:
+                return abstract["value"]
+            return abstracts[0]["value"]


i think this may return too early. Because the fallback return abstracts[0]["value"] is inside the loop, it looks like we only really inspect the first abstract. So if a preferred non-arxiv/non-cds abstract appears later in the list, we would never reach it. Was the fallback meant to be outside the loop?

TahaKhan998 · 2026-05-13T15:04:23Z

+            source = abstract.get("source", "").lower()
+            if source and source in ["arxiv", "cds"]:
+                return abstract["value"]
+            return abstracts[0]["value"]


I think this may have the same early return issue as article.py.

kpsherva moved this to In progress in Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva added this to Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva self-assigned this Mar 11, 2026

kpsherva removed this from Sprint Q2 2026 ☀️ Mar 11, 2026

kpsherva force-pushed the multiple-doc-types branch from 241a5e4 to ad92c3a Compare March 12, 2026 08:12

kpsherva force-pushed the multiple-doc-types branch from ad92c3a to afa86dc Compare March 20, 2026 16:07

kpsherva commented Mar 20, 2026

View reviewed changes

kpsherva marked this pull request as ready for review March 23, 2026 12:35

kpsherva added 12 commits April 27, 2026 16:34

change(harvester): improve writer interface

deead9b

* change (harvester): skip update if metadata is equal # Conflicts: # site/cds_rdm/inspire_harvester/writer.py

chore(resolver): return always the same type

6e32b8d

refactor: exception handling # Conflicts: # site/cds_rdm/legacy/redirector.py # Conflicts: # site/cds_rdm/legacy/redirector.py

fix(schemes): pattern for inis id

b45c63f

feat(harvester): introduce notion of versions in the writer

095afe2

* refactor(harvester): add specialised classes to handle drafts, files, matching records

feat(harvester): add specialized mappers per res type

bdbe233

tests: add multiple resource type case

21af5e1

chore(harvester): fix tests

fccc044

wip: adjust file field update

b8129f1

change(harvester): when to update files policy

21fd7d4

* change(harvester): usage of resource types

change(harvester): splitting metadata assignment by resource type

ba1bf6a

* change(harvester): add license mapping

chore(harvester): formatting

239abee

feat(harvester): add ROR assignment in mapping

85fa1ee

kpsherva force-pushed the multiple-doc-types branch from 3062e73 to 85fa1ee Compare April 27, 2026 14:48

fix(tests): resource type assert

6e1a803

TahaKhan998 reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple doc types#732

Multiple doc types#732
kpsherva wants to merge 13 commits into
CERNDocumentServer:masterfrom
kpsherva:multiple-doc-types

kpsherva commented Mar 11, 2026 •

edited

Loading

Uh oh!

kpsherva Mar 20, 2026

Uh oh!

TahaKhan998 May 13, 2026

Uh oh!

TahaKhan998 May 13, 2026

Uh oh!

TahaKhan998 May 13, 2026 •

edited

Loading

Uh oh!

TahaKhan998 May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kpsherva commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpsherva Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

TahaKhan998 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

TahaKhan998 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

TahaKhan998 May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TahaKhan998 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kpsherva commented Mar 11, 2026 •

edited

Loading

TahaKhan998 May 13, 2026 •

edited

Loading