Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
4c1631d
Merge branch 'refactor/data-model' into refactor/384-test-ld_dict
sdruskat Sep 9, 2025
1feddda
Add basic implementation of API class
sdruskat Sep 9, 2025
74ba45d
Test initialization of API class
sdruskat Sep 9, 2025
79575b8
Test API object initiatlization with and without data
sdruskat Sep 17, 2025
69f6a24
Test API object initialization with nested object
sdruskat Sep 17, 2025
8e1a38b
Test appending objects to model via API
sdruskat Sep 17, 2025
b65989e
Test model building via API object
sdruskat Sep 17, 2025
59180c7
added an add method to SoftwareMetadata and improved __init__ of it a…
Sep 25, 2025
daed5d3
Change existing test to assume returned lists
sdruskat Sep 26, 2025
4583915
Add test for harvesting case
sdruskat Sep 26, 2025
6808272
Add more comprehensive usage test
sdruskat Sep 26, 2025
2f7eadf
Add new license annotation for Python files
sdruskat Sep 26, 2025
0f32494
changed conversions of types to output ld_lists for every item in a dict
Sep 26, 2025
8298e49
added some tests for the conversions and formated to satisfy flake8
Sep 26, 2025
3a8bfbe
added three more conversions for container to expanded json
Sep 26, 2025
2ef89d3
always return a list when getting an item from ld_dict
Sep 26, 2025
2db93cf
added tests and fixed issues
Sep 26, 2025
1721325
clean up
Oct 2, 2025
1fb7574
Comment out local extension that breaks build
sdruskat Oct 2, 2025
8d147a8
Document data model API
sdruskat Oct 2, 2025
7e1ac64
Update dependency lock
sdruskat Oct 2, 2025
9be8041
removed tests of unclear matters (@type and @context fields) and adde…
Oct 6, 2025
dd854c7
Track data in model in simplified form
sdruskat Oct 17, 2025
f4c1e7d
Link to dummy section
sdruskat Oct 17, 2025
97ebad4
Make tone less intimidating, more neutral/positive
sdruskat Oct 17, 2025
7adb02f
Merge branch 'refactor/data-model' into refactor/423-implement-public…
SKernchen Dec 19, 2025
6f039e8
slightly adjusted tests and fixed miniature bugs in ld_container and …
Dec 19, 2025
c2b9c4f
cleaned up __init__.py
Jan 5, 2026
bd1a19f
ran 'poetry lock'
Jan 5, 2026
9527e26
updated type hints to be supported by python 3.10
Jan 5, 2026
97d9d95
update type hints and began commenting ld_dict
Jan 5, 2026
cd6e3d5
added and updated comments
Jan 9, 2026
de457e3
Apply Style Changes (Author names instead of foo etc.)
SKernchen Jan 13, 2026
e9d010f
Correct lower letters for emails
SKernchen Jan 13, 2026
23ec20c
Merge branch 'refactor/423-implement-public-api' into refactor/423-im…
SKernchen Jan 13, 2026
8ace58c
Merge pull request #438 from softwarepub/refactor/423-implement-publi…
SKernchen Jan 13, 2026
605201d
fixed small bug in set_item of ld_dict
Jan 16, 2026
eb6f587
Fix compact_iri for schema elements with containers
SKernchen Jan 16, 2026
0284b01
Correct Docs for newer functionality
SKernchen Jan 20, 2026
d46394e
Correct type of value
SKernchen Jan 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions REUSE.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,9 @@ path = ["REUSE.toml"]
precedence = "aggregate"
SPDX-FileCopyrightText = "German Aerospace Center (DLR), Helmholtz-Zentrum Dresden-Rossendorf, Forschungszentrum Jülich"
SPDX-License-Identifier = "CC0-1.0"

[[annotations]]
path = ["src/**/*.py", "test/**/*.py"]
precedence = "aggregate"
SPDX-FileCopyrightText = "German Aerospace Center (DLR), Helmholtz-Zentrum Dresden-Rossendorf, Forschungszentrum Jülich"
SPDX-License-Identifier = "Apache-2.0"
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def read_version_from_pyproject():
'sphinx_togglebutton',
'sphinxcontrib.datatemplates',
# Custom extensions, see `_ext` directory.
'plugin_markup',
# 'plugin_markup',
]

language = 'en'
Expand Down
285 changes: 272 additions & 13 deletions docs/source/dev/data_model.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,286 @@
<!--
SPDX-FileCopyrightText: 2022 German Aerospace Center (DLR)
SPDX-FileCopyrightText: 2025 German Aerospace Center (DLR)

SPDX-License-Identifier: CC-BY-SA-4.0
-->

<!--
SPDX-FileContributor: Michael Meinel
SPDX-FileContributor: Stephan Druskat <stephan.druskat@dlr.de>
-->

# HERMES Data Model
# Data model

*hermes* uses an internal data model to store the output of the different stages.
All the data is collected in a directory called `.hermes` located in the root of the project directory.
`hermes`' internal data model acts like a contract between `hermes` and plugins.
It is based on [**JSON-LD (JSON Linked Data)**](https://json-ld.org/), and
the public API simplifies interaction with the data model through Python code.

You should not need to interact with this data directly.
Instead, use {class}`hermes.model.context.HermesContext` and respective subclasses to access the data in a consistent way.
Output of the different `hermes` commands consequently is valid JSON-LD, serialized as JSON, that is cached in
subdirectories of the `.hermes/` directory that is created in the root of the project directory.

The cache is purely for internal purposes, its data should not be interacted with.

## Harvest Data
Depending on whether you develop a plugin for `hermes`, or you develop `hermes` itself, you need to know either [_some_](#json-ld-for-plugin-developers),
or _quite a few_ things about JSON-LD.

The data of the havesters is cached in the sub-directory `.hermes/harvest`.
Each harvester has a separate cache file to allow parallel harvesting.
The cache file is encoded in JSON and stored in `.hermes/harvest/HARVESTER_NAME.json`
where `HARVESTER_NAME` corresponds to the entry point name.
The following sections provide documentation of the data model.
They aim to help you get started with `hermes` plugin and core development,
even if you have no previous experience with JSON-LD.

{class}`hermes.model.context.HermesHarvestContext` encapsulates these harvester caches.
## The data model for plugin developers

If you develop a plugin for `hermes`, you will only need to work with a single Python class and the public API
it provides: {class}`hermes.model.SoftwareMetadata`.

To work with this class, it is necessary that you know _some_ things about JSON-LD.

### JSON-LD for plugin developers

```{attention}
Work in progress.
```


### Working with the `hermes` data model in plugins

> **Goal**
> Understand how plugins access the `hermes` data model and interact with it.

`hermes` aims to hide as much of the data model as possible behind a public API
to avoid that plugin developers have to deal with some of the more complex features of JSON-LD.

#### Model instances in different types of plugin

You can extend `hermes` with plugins for three different commands: `harvest`, `curate`, `deposit`.

The commands differ in how they work with instances of the data model.

- `harvest` plugins _create_ a single new model instance and return it.
- `curate` plugins are passed a single existing model instance (the output of `process`),
and return a single model instance.
- `deposit` plugins are passed a single existing model instance (the output of `curate`),
and return a single model instance.

#### How plugins work with the API

```{important}
Plugins access the data model _exclusively_ through the API class {class}`hermes.model.SoftwareMetadata`.
```

The following sections show how this class works.

##### Creating a data model instance

Model instances are primarily created in `harvest` plugins, but may also be created in other plugins to map
existing data into.

To create a new model instance, initialize {class}`hermes.model.SoftwareMetadata`:

```{code-block} python
:caption: Initializing a default data model instance
from hermes.model import SoftwareMetadata

data = SoftwareMetadata()
```

`SoftwareMetadata` objects initialized without arguments provide the default _context_
(see [_JSON-LD for plugin developers_](#json-ld-for-plugin-developers)).
This means that now, you can use terms from the schemas included in the default context to describe software metadata.

Terms from [_CodeMeta_](https://codemeta.github.io/terms/) can be used without a prefix:

```{code-block} python
:caption: Using terms from the default schema
data["readme"] = ...
```

Terms from [_Schema.org_](https://schema.org/) can be used with the prefix `schema`:

```{code-block} python
:caption: Using terms from a non-default schema
data["schema:copyrightNotice"] = ...
```

You can also use other linked data vocabularies. To do this, you need to identify them with a prefix and register them
with the data model by passing it `extra_vocabs` as a `dict` mapping prefixes to URLs where the vocabularies are
provided as JSON-LD:

```{code-block} python
:caption: Injecting additional schemas
from hermes.model import SoftwareMetadata

# Contents served at https://bar.net/schema.jsonld:
# {
# "@context":
# {
# "name": "https://schema.org/name"
# }
# }

data = SoftwareMetadata(extra_vocabs={"foo": "https://bar.net/schema.jsonld"})

data["foo:name"] = ...
```

##### Adding data

Once you have an instance of {class}`hermes.model.SoftwareMetadata`, you can add data to it,
i.e., metadata that describes software:

```{code-block} python
:caption: Setting data values
data["name"] = "My Research Software" # A simple "Text"-type value
# → Simplified model representation : { "name": [ "My Research Software" ] }
# Cf. "Accessing data" below
data["author"] = {"name": "Shakespeare"} # An object value that uses terms available in the defined context
# → Simplified model representation : { "name": [ "My Research Software" ], "author": [ { "name": "Shakespeare" } ] }
# Cf. "Accessing data" below
```

##### Accessing data

You need to be able to access data in the data model instance to add, edit or remove data.
Data can be accessed by using term strings, similar to how values in Python `dict`s are accessed by keys.

```{important}
When you access data from a data model instance,
it will always be returned in a **list**-like object!
```

The reason for providing data in list-like objects is that JSON-LD treats all property values as arrays.
Even if you add "single value" data to a `hermes` data model instance via the API, the underlying JSON-LD model
will treat it as an array, i.e., a list-like object:

```{code-block} python
:caption: Internal data values are arrays
data["name"] = "My Research Software" # → [ "My Research Software" ]
data["author"] = {"name": "Shakespeare"} # → [ { "name": [ "Shakespeare" ] } ]
```

Therefore, you access data in the same way you would access data from a Python `list`:

1. You access single values using indices, e.g., `data["name"][0]`.
2. You can use a list-like API to interact with data objects, e.g.,
`data["name"].append("Hamilton")`, `data["name"].extend(["Hamilton", "Knuth"])`, `for name in data["name"]: ...`, etc.

##### Interacting with data

The following longer example shows different ways that you can interact with `SoftwareMetadata` objects and the data API.

```{code-block} python
:caption: Building the data model
from hermes.model import SoftwareMetadata

# Create the model object with the default context
data = SoftwareMetadata()

# Let's create author metadata for our software!
# Below each line of code, the value of `data["author"]` is given.

data["author"] = {"name": "Shakespeare"}
# → [{'name': ['Shakespeare']}]

data["author"].append({"name": "Hamilton"})
# [{'name': ['Shakespeare']}, {'name': ['Hamilton']}]

data["author"][0]["email"] = "shakespeare@baz.net"
# [{'name': ['Shakespeare'], 'email': ['shakespeare@baz.net']}, {'name': ['Hamilton']}]

data["author"][1]["email"].append("hamilton@baz.net")
# [{'name': ['Shakespeare'], 'email': ['shakespeare@baz.net']}, {'name': ['Hamilton'], 'email': ['hamilton@baz.net']}]

data["author"][1]["email"].extend(["hamilton@spam.org", "hamilton@eggs.com"])
# [
# {'name': ['Shakespeare'], 'email': ['shakespeare@baz.net']},
# {'name': ['Hamilton'], 'email': ['hamilton@baz.net', 'hamilton@spam.org', 'hamilton@eggs.com']}
# ]
```

The example continues to show how to iterate through data.

```{code-block} python
:caption: for-loop, containment check
for i, author in enumerate(data["author"], start=1):
if author["name"][0] in ["Shakespeare", "Hamilton"]:
print(f"Author {i} has expected name.")
else:
raise ValueError("Unexpected author name found!", author["name"][0])

# Mock output:
# $> Author 1 has expected name.
# $> Author 2 has expected name.
```

```{code-block} python
:caption: Value check
for email in data["author"][0]["email"]:
if email.endswith(".edu"):
print("Shakespeare has an email address at an educational institution.")
else:
print("Cannot confirm affiliation with educational institution for Shakespeare.")

# Mock output
# $> Cannot confirm affiliation with educational institution for author.
```

```{code-block} python
:caption: Value check and list comprehension
if all(["hamilton" in email for email in data["author"][1]["email"]]):
print("Author has only emails with their name in it.")

# Mock output
# $> Author has only emails with their name in it.
```

The example continues to show how to assert data values.

As mentioned in the [introduction to the data model](#data-model),
`hermes` uses a JSON-LD-like internal data model.
The API class {class}`hermes.model.SoftwareMetadata` hides many
of the more complex aspects of JSON-LD and makes it easy to work
with the data model.

So the API class hides the internal model objects.
Therefore, they work as you would expect from plain
Python data:

```{code-block} python
:caption: Naive containment assertion that raises
:emphasize-lines: 5,13
try:
assert (
{'name': ['Shakespeare'], 'email': ['shakespeare@baz.net']}
in
data["author"]
)
print("The author was found!")
except AssertionError:
print("The author could not be found.")
raise

# Mock output
# $> The author was found!
#
#
# Internal Model from data["author"]:
# {'@list': [
# {
# 'http://schema.org/name': [{'@value': 'Shakespeare'}],
# 'http://schema.org/email': [{'@value': 'shakespeare@baz.net'}]
# },
# {
# 'http://schema.org/name': [{'@value': 'Hamilton'}],
# 'http://schema.org/email': [
# {'@list': [
# {'@value': 'hamilton@baz.net'}, {'@value': 'hamilton@spam.org'}, {'@value': 'hamilton@eggs.com'}
# ]}
# ]
# }]
# }
# )
```

---

## See Also

- API reference: {class}`hermes.model.SoftwareMetadata`
Loading
Loading