Multiple doc types#732
Conversation
241a5e4 to
ad92c3a
Compare
ad92c3a to
afa86dc
Compare
| del new_version_entry["pids"] | ||
|
|
||
| # Preserve existing programmes in new versions | ||
| existing_programmes = ( |
There was a problem hiding this comment.
check if the custom fields are preserved between versions
* change (harvester): skip update if metadata is equal # Conflicts: # site/cds_rdm/inspire_harvester/writer.py
refactor: exception handling # Conflicts: # site/cds_rdm/legacy/redirector.py # Conflicts: # site/cds_rdm/legacy/redirector.py
* refactor(harvester): add specialised classes to handle
drafts, files, matching records
* change(harvester): usage of resource types
* change(harvester): add license mapping
3062e73 to
85fa1ee
Compare
| # error_message = f"Unexpected error while processing entry: {str(e)}." | ||
| import traceback | ||
| traceback.print_exc() | ||
| except Exception as e: |
There was a problem hiding this comment.
I think the generic except Exception case is a bit risky here. For WriterError and ValidationError we actually build an error_message and add it to stream_entry.errors, so the failed entry is tracked properly. But here we only print the traceback and then continue. So if _route() fails because of some unexpected bug, nothing gets added to stream_entry.errors, op_type can stay unset, and the function still returns the entry. That feels like it could leave it in a half-failed state without the failure being tracked clearly. Should we either add an explicit error there too?
| for doc_type in doc_types: | ||
| self.logger.debug(f"Mapping {doc_type} to version.") | ||
| resource_type = INSPIRE_DOCUMENT_TYPE_MAPPING[doc_type] | ||
| self.logger.info(f"Mapped {doc_type} to {resource_type}.") | ||
| if resource_type is not self.main_res_type: | ||
| version_ctx = MetadataSerializationContext( | ||
| resource_type=resource_type, | ||
| inspire_id=self.inspire_id, | ||
| cds_rdm_id=self.cds_id, | ||
| ) | ||
| mappers = self.policy.build_for(resource_type) | ||
| assert_unique_ids(mappers) | ||
| patches = [ | ||
| m.apply(self.inspire_record, version_ctx, self.logger) | ||
| for m in mappers | ||
| ] | ||
|
|
There was a problem hiding this comment.
I might be misunderstanding this, but is there a risk of generating duplicate versions for the same target resource type here? split() loops over every raw INSPIRE document_type, and if two different doc types map to the same CDS resource_type, it looks like we would append two versions with the same target type. Then later the writer would process that same resource type more than once. I’m also wondering if the unused _PREPRINT_DOC_TYPES constant was meant to group some of these doc types together instead.
| source = abstract.get("source", "").lower() | ||
| if source and source not in ["arxiv", "cds"]: | ||
| return abstract["value"] | ||
| return abstracts[0]["value"] No newline at end of file |
There was a problem hiding this comment.
i think this may return too early. Because the fallback return abstracts[0]["value"] is inside the loop, it looks like we only really inspect the first abstract. So if a preferred non-arxiv/non-cds abstract appears later in the list, we would never reach it. Was the fallback meant to be outside the loop?
| source = abstract.get("source", "").lower() | ||
| if source and source in ["arxiv", "cds"]: | ||
| return abstract["value"] | ||
| return abstracts[0]["value"] No newline at end of file |
There was a problem hiding this comment.
I think this may have the same early return issue as article.py.
Uh oh!
There was an error while loading. Please reload this page.