Feature/276 harvesting metadata from a provided repository url#278
Conversation
|
@sferenz |
|
Harvesting metadata from the provided URL (GitHub/GitLab). Command: |
|
@sferenz |
sferenz
left a comment
There was a problem hiding this comment.
Thanks for the nice code! Please have a look at the comments :)
src/hermes/commands/base.py
Outdated
| """Load settings from the configuration file (passed in from command line).""" | ||
|
|
||
| toml_data = toml.load(args.path / args.config) | ||
| toml_data = toml.load("." / args.config) |
There was a problem hiding this comment.
Does this still work if a regular path is given to HERMES?
There was a problem hiding this comment.
Yes, specifying a directory path containing CFF or CodeMeta files is also acceptable. For example, the following command works:
hermes harvest --path C:\path\to\your\directory
|
|
||
|
|
||
| class HarvestSettings(BaseModel): | ||
| class _HarvestSettings(BaseModel): |
There was a problem hiding this comment.
I didn’t intend to change the class name. There was an issue with incorrectly pulling the original code for base.py from an updated version of it. This occurred due to a recent update in the settings classes, where all were made private in the develop branch of HERMES (commit a6c1a5e).
| return None | ||
|
|
||
|
|
||
| def _download_to_tempfile(url: str, filename: str) -> pathlib.Path: |
There was a problem hiding this comment.
Do you delete the tempfiles later?
There was a problem hiding this comment.
In the current code the temp files for CFF and CodeMeta are stored separately in C:\Temp on the local machine.
There was a problem hiding this comment.
Will these files be deleted after the extraction process?
There was a problem hiding this comment.
The files won't be deleted after harvesting, however, I can modify the code to delete temp files after extraction. Do you agree with this change?
There was a problem hiding this comment.
Yes, I think the temp files should be deleted at the end of the process.
There was a problem hiding this comment.
I updated the code to remove the temp files after the harvesting process. Could you please have a look on the changes?
@sferenz Thank you for the comments. |
Add functionality to remove temp files generated during remote harvesting.
Remove temp files after harvesting CFF metadata
Remove temp files after harvesting CodeMeta metadata
…rovided-repository-URL' to incorporate the recent updates
To support repository URL as a path
|
@sdruskat This pull request is ready to merge, can you please assign us a reviewer? |
There was a problem hiding this comment.
Thanks for your work!
I had a first look and I would like to suggest a slightly different approach. I think it would be beneficial to have the --url argument that you had (as indicated by your PR description). This would allow us to do the following:
- Create a temporary directory
- download the remote repository given by
--urlto this directory - overwrite
args.pathwith the temporary directory path - run the normal harvesting step
- delete the temporary directory
In this case there is no need to change anything in any of the plugins (I think). Only the base harvest command needs to worry about downloading and then deleting the files.
What do you think?
| return None | ||
|
|
||
|
|
||
| def remove_temp_file(file_path: pathlib.Path, temp_dir: pathlib.Path = pathlib.Path("C:/Temp")): |
There was a problem hiding this comment.
C:/Temp is windows-specific. You could use tempfile.TemporaryDirectory and place the files in there. Then, instead of deleting the files one by one, you can use .cleanup() on the TemporaryDirectory object.
| return corrected_url.replace("https:/", "https://") | ||
|
|
||
|
|
||
| def fetch_metadata_from_repo(repo_url: str, filename: str) -> t.Optional[pathlib.Path]: |
There was a problem hiding this comment.
This method makes multiple HTTP requests. I think it would be nice to use the hermes user agent, just to let the services know who we are. You can do something like:
from hermes.utils import hermes_user_agent
session = requests.Session()
session.headers.update({"User-Agent": hermes_user_agent})then use the session to make the requests:
session.get(api_url)
Thanks! I’ve implemented this approach and am testing it with a few different repositories. |
Added the URL to the
hermes harvestcommand. Now, the commandhermes harvestharvest the metadata from the local repository, andhermes harvest --url <URL>allows harvesting metadata from the provided URL, with support for GitHub and GitLab repositories.(e.g.,
hermes harvest --url https://github.com/NFDI4Energy/SMECS)