Skip to content

Multiple doc types#732

Open
kpsherva wants to merge 13 commits into
CERNDocumentServer:masterfrom
kpsherva:multiple-doc-types
Open

Multiple doc types#732
kpsherva wants to merge 13 commits into
CERNDocumentServer:masterfrom
kpsherva:multiple-doc-types

Conversation

@kpsherva
Copy link
Copy Markdown
Contributor

@kpsherva kpsherva commented Mar 11, 2026

del new_version_entry["pids"]

# Preserve existing programmes in new versions
existing_programmes = (
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the custom fields are preserved between versions

@kpsherva kpsherva marked this pull request as ready for review March 23, 2026 12:35
kpsherva added 12 commits April 27, 2026 16:34
* change (harvester): skip update if metadata is equal

# Conflicts:
#	site/cds_rdm/inspire_harvester/writer.py
refactor: exception handling

# Conflicts:
#	site/cds_rdm/legacy/redirector.py

# Conflicts:
#	site/cds_rdm/legacy/redirector.py
* refactor(harvester): add specialised classes to handle
                       drafts, files, matching records
* change(harvester): usage of resource types
@kpsherva kpsherva force-pushed the multiple-doc-types branch from 3062e73 to 85fa1ee Compare April 27, 2026 14:48
# error_message = f"Unexpected error while processing entry: {str(e)}."
import traceback
traceback.print_exc()
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generic except Exception case is a bit risky here. For WriterError and ValidationError we actually build an error_message and add it to stream_entry.errors, so the failed entry is tracked properly. But here we only print the traceback and then continue. So if _route() fails because of some unexpected bug, nothing gets added to stream_entry.errors, op_type can stay unset, and the function still returns the entry. That feels like it could leave it in a half-failed state without the failure being tracked clearly. Should we either add an explicit error there too?

Comment on lines +71 to +87
for doc_type in doc_types:
self.logger.debug(f"Mapping {doc_type} to version.")
resource_type = INSPIRE_DOCUMENT_TYPE_MAPPING[doc_type]
self.logger.info(f"Mapped {doc_type} to {resource_type}.")
if resource_type is not self.main_res_type:
version_ctx = MetadataSerializationContext(
resource_type=resource_type,
inspire_id=self.inspire_id,
cds_rdm_id=self.cds_id,
)
mappers = self.policy.build_for(resource_type)
assert_unique_ids(mappers)
patches = [
m.apply(self.inspire_record, version_ctx, self.logger)
for m in mappers
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding this, but is there a risk of generating duplicate versions for the same target resource type here? split() loops over every raw INSPIRE document_type, and if two different doc types map to the same CDS resource_type, it looks like we would append two versions with the same target type. Then later the writer would process that same resource type more than once. I’m also wondering if the unused _PREPRINT_DOC_TYPES constant was meant to group some of these doc types together instead.

source = abstract.get("source", "").lower()
if source and source not in ["arxiv", "cds"]:
return abstract["value"]
return abstracts[0]["value"] No newline at end of file
Copy link
Copy Markdown

@TahaKhan998 TahaKhan998 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this may return too early. Because the fallback return abstracts[0]["value"] is inside the loop, it looks like we only really inspect the first abstract. So if a preferred non-arxiv/non-cds abstract appears later in the list, we would never reach it. Was the fallback meant to be outside the loop?

source = abstract.get("source", "").lower()
if source and source in ["arxiv", "cds"]:
return abstract["value"]
return abstracts[0]["value"] No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have the same early return issue as article.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inspire Harvester: Explore and implement a solution to map multiple document types as resource type

2 participants