diff --git a/docs/datasets/ingestion-guide/index.md b/docs/datasets/ingestion-guide/index.md new file mode 100644 index 0000000..85b3fae --- /dev/null +++ b/docs/datasets/ingestion-guide/index.md @@ -0,0 +1,203 @@ +# Dataset Ingestion Guide + +Once you have your database, the SciCat API server up and running, several options of ingesting a dataset into SciCat exist - ranging from quick, one-time ingestion via CURL (not recommended) to a fully automized ingestion setup using python software for SciCat API access based on [`pydantic`]([https://pydantic.dev/docs/validation/latest/concepts/models/]) (a python validation class to ensure valid formats) for example [`pyscicat`](https://www.scicatproject.org/pyscicat/) or SciCat's `python-sdk`. + +Another example that uses Jupyter Notebook in SciCatLive can be found [here]([https://github.com/SciCatProject/scicatlive/blob/main/services/jupyter/config/notebooks/pyscicat.ipynb) which includes how to authenticate, create a dataset, add datablocks and upload an attachement. + +## The `CURL` command +The highest chance to make a successful request to one of SciCats endpoints is to learn from Swagger. Browse to see the syntax formation, obtain a valid token via the auth login or through looking at the Users endpoint in settings on the frontend, provide the correct fields ensuring you exclude the forbidden fields. In the following we give a skeleton for examples. + +### Simple GET request +with a known `pid` as authenticated user (replace placeholders) + +```bash +curl -X "GET" \ + "http://"${URL}"/api/v4/datasets/${pid}" \ + -H "accept: application/json" + -H 'Authentication: Bearer "${TOKEN}"' +``` +If you are only interested in public records , you will not need to authenticate so repeat the command without the last line `-H Authentication: Bearer "${TOKEN}"`. + +### GET identifiers of ```origdatablocks``` (OIDs) of a dataset + +The dataset is identified by its pid, we use v4 API and first define a filter + +```bash +export URL="valid_url" +export TOKEN="valid_token" +export pid="example_pid" + +# -- GET the oids ------------------ +FILTER_JSON='{ + "where": { "pid": "'${pid}'" }, + "include": [ + { + "relation": "origdatablocks", + "scope": { + "fields": ["_id"], + "limit": 20, + "order": "filename ASC" + } + } + ], + "fields": ["datasetName"], + "limit": 10 +}' + +curl -G "${URL}/api/v4/datasets" \ + --data-urlencode "filter=${FILTER_JSON}" \ + -H "Accept: application/json" \ + -H "Authorization: Bearer ${TOKEN}" + +``` + +### Create a dataset +Make sure the json is formatted OK. Some fields are mandatory, you can check in swagger to see which fields are mandatory or not, see also [swagger documentation](../../swagger/index.md#tips-and-tricks). +```bash +curl -X 'POST' \ + 'https://${URL}/api/v4/datasets' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "ownerGroup": "string", + "accessGroups": [ + "string" + ], + "instrumentGroup": "string", + "owner": "string", + "ownerEmail": "user@example.com", + "orcidOfOwner": "string", + "contactEmail": "string", + "sourceFolder": "string", + "sourceFolderHost": "string", + "size": 0, + "packedSize": 0, + "numberOfFiles": 0, + "numberOfFilesArchived": 0, + "creationTime": "2026-04-22T10:54:53.642Z", + "validationStatus": "string", + "keywords": [ + "string" + ], + "description": "string", + "datasetName": "string", + "classification": "string", + "license": "string", + "isPublished": false, + "techniques": [], + "sharedWith": [], + "relationships": [], + "datasetlifecycle": {}, + "scientificMetadata": {}, + "scientificMetadataSchema": "string", + "scientificMetadataValid": true, + "comment": "string", + "dataQualityMetrics": 0, + "principalInvestigators": [ + "string" + ], + "startTime": "2026-04-22T10:54:53.642Z", + "endTime": "2026-04-22T10:54:53.642Z", + "creationLocation": "string", + "dataFormat": "string", + "proposalIds": [ + "string" + ], + "sampleIds": [ + "string" + ], + "instrumentIds": [ + "string" + ], + "inputDatasets": [ + "string" + ], + "usedSoftware": [ + "string" + ], + "jobParameters": {}, + "jobLogData": "string", + "runNumber": "string", + "pid": "string", + "type": "string" +}' + +``` +Note, that datafiles and attachments related to a SciCat dataset need to be POSTed separtely as they are a priori independent entities. + +### Adding datafiles +Associated datafiles are organised in datablocks. There are original datablocks (`origdatablocks`) and `datablocks`. The latter are obsolete and functionality has fully moved to `origdatablocks`. To attach your metadata of these associated datafiles to the dataset use e.g. `/api/v4/origdatablock`. Note, that the dataset to which the blocks belong are indicated by `dataasetId` which corresponds to the `pid` field of the dataset itself. This can be achieved by the following command (and placeholders replaced): + +```bash +curl -X 'POST' \ + 'http://localhost:3000/api/v4/origdatablocks' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "ownerGroup": "string", + "accessGroups": [ + "string" + ], + "instrumentGroup": "string", + "size": 0, + "chkAlg": "string", + "dataFileList": [ + { + "path": "string", + "size": 0, + "time": "2026-04-22T11:28:56.870Z", + "chk": "string", + "uid": "string", + "gid": "string", + "perm": "string", + "type": "string" + } + ], + "isPublished": true, + "datasetId": "string" +}' +``` +Also note, that the `ownerGroup` must match the same field in the dataset. Same holds for attachments. + +### Adding attachments + +Here too a POST request will ingest attachments to the dataset, e.g. like this (with placeholders replaced): + +```bash +curl -X 'POST' \ + 'http://localhost:3000/api/v4/attachments' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "ownerGroup": "string", + "accessGroups": [ + "string" + ], + "instrumentGroup": "string", + "thumbnail": "string", + "caption": "string", + "relationships": [ + { + "targetId": "string", + "targetType": "dataset", + "relationType": "is attached to" + } + ], + "isPublished": true, + "aid": "string" +}' +``` + + +## Pythonic way: python sdk +The python software development kit, sdk, is entirely generated from the backend based on the OpenAPI initiative and swagger definitions. Find more info [here](https://www.piwheels.org/project/scicat-sdk-py/). + +## Pythonic way: pyscicat +`pyscicat` has the same functionality as the python sdk but is meant to be more user friendly and maintained by [dmreyno](https://pypi.org/user/dmcreyno/). Some intuitive examples and its documentation how to ingest can be found [here](https://www.scicatproject.org/pyscicat/howto/ingest.html). + +## Pythonic way: sciteacean +Scitacean is a high level Python package for downloading and uploading datasets from and to SciCat. + +See the [documentation](https://www.scicatproject.org/scitacean/) for installation and usage instructions. + +If you need help, have a look at our [GitHub discussions](https://github.com/SciCatProject/scitacean). For questions, please start a Q&A discussion if you can't find an answer. diff --git a/docs/operator-guide/index.md b/docs/operator-guide/index.md index 3a72c07..cb1999a 100644 --- a/docs/operator-guide/index.md +++ b/docs/operator-guide/index.md @@ -14,13 +14,10 @@ SciCat covers these core aspects in a flexible way: 1. Searchable metadata fields, most common and highly specific ones. SciCat was developed by the PaNoSc community and has been successfully used more widely. This is because SciCat is highly configurable. 2. Provision of unique persistent identifiers not only for the internal catalogue, but also connecting to the global DOI system through e.g. ready pathway to publication via [DataCite](https://datacite.org/). -SciCat is an open source project can can be developed in accordance with our [license](https://github.com/SciCatProject/scicat-backend-next?tab=BSD-3-Clause-1-ov-file#readme). +SciCat is an open source project can can be developed in accordance with our [license](https://github.com/SciCatProject/backend?tab=BSD-3-Clause-1-ov-file#readme). ## Dataset ingestion -You find here a pythonic way of metadata ingestion using SciCats API based on the PySciCat client: -See this [how-to-ingest doc](https://www.scicatproject.org/pyscicat/howto/ingest.html) to get started. - -Another example that uses Jupyter Notebook in SciCatLive (see below) can be found [here]([https://github.com/SciCatProject/scicatlive/blob/main/services/jupyter/config/notebooks/pyscicat.ipynb) which includes how to authenticate, create a dataset, add datablocks and upload an attachement. +There are several ways of ingesting a dataset into SciCat, [here](../datasets/ingestion-guide/index.md) are the details. ## Up-to-date operator's information Generally, the [**scicatlive**](https://www.scicatproject.org/scicatlive/latest/) documentation contains an up-to-date information how to set up and run the system ```SciCat``` interfacing it with various external, site-specific services. For troublshooting issues, please refer [the User's Guide](../troubleshoot/index.md). diff --git a/docs/sites/img/SciCatATPSI.png b/docs/sites/img/SciCatATPSI.png new file mode 100644 index 0000000..6dfd874 Binary files /dev/null and b/docs/sites/img/SciCatATPSI.png differ diff --git a/docs/swagger/img/swagger_required_fields.png b/docs/swagger/img/swagger_required_fields.png new file mode 100644 index 0000000..82bfb7e Binary files /dev/null and b/docs/swagger/img/swagger_required_fields.png differ diff --git a/docs/swagger/index.md b/docs/swagger/index.md index 6ae34fd..ab83fd6 100644 --- a/docs/swagger/index.md +++ b/docs/swagger/index.md @@ -13,6 +13,13 @@ You need to authenticate twice: 1. Get the **SciCat token** from the user setting when logged into SciCat via the main GUI. Copy paste it into the field "Authorize" in the explorer on the top right. ![swagger login](img/swagger_getToken.png)

-2. Login on the explorer page again with the same credentials. ![swagger login](img/swaggerLogin.png) +2. Login on the explorer page again with the same credentials using the token. ![Swagger login](img/swaggerLogin.png) + +## Tips and Tricks + +To see which fields are required check the "Schema" link next to the _Example Value_ in the **Response body** section. They are marked with a red asterisk. ![required fields](img/swagger_required_fields.png) + + +