Provides a secure, lineage-aware, metadata-rich interface to heterogeneous datasets (PostgreSQL, object storage, filesystem). Exposes a DCAT-AP 3.0 compatible catalogue, a governed SQL query interface, and OpenLineage-integrated provenance, designed to support Digital Twins, analytical applications, and DSSC-aligned dataspace participants.
Public catalogue endpoint returning application/ld+json responses conforming to DCAT-AP 3.0.
GET /catalogue— full catalogue as adcat:Catalognode with embeddeddcat:Datasetanddcat:DistributionnodesGET /catalogue/{id}— single dataset by IDPOST /catalogue/search— filtered search byq,access_level,keywords
Each dataset includes dct:publisher, dcat:theme, dct:language, dct:spatial, dct:accrualPeriodicity, and odrl:hasPolicy on every distribution. Publisher URI is derived from governance.yaml; the fallback is settings.catalog_uri.
The downloadURL is only present on distributions with access_level: open. All other distributions require negotiating access through a dataspace connector.
SQL SELECT queries over exposed datasets with strict validation, server-side pagination, hard row caps, and row-level filters.
POST /query— accepts{"sql": "SELECT ...", "limit": 50, "offset": 0, "skip_count": false}- Validates SQL (SELECT-only, table allowlist, function allowlist)
- Supports spatial PostGIS functions:
ST_Intersects,ST_Within,ST_Contains,ST_Transform,ST_Distance,ST_SetSRID,ST_GeomFromGeoJSON,ST_Point,ST_XMin/YMin/XMax/YMax,ST_Extent - Supports
INclauses with tuples, string/numeric/date functions, aggregates - Enforces
LIMIT/OFFSETserver-side - Configurable query timeout via
QUERY_STATEMENT_TIMEOUT_MS(default 5000ms) skip_count: trueskips theCOUNT(*)query to avoid full table scans- Applies row-level filter plans from governance handlers (
direct_user_match,rec_registry,http_in_list,table_pointer)
When queries arrive through the EDC data plane (via an Endpoint Data Reference), the API detects the EDR context via the Edc-Contract-Agreement-Id header and switches to a dataspace-specific enforcement path.
Enabled by:
EDR_ENABLED=true
CONNECTOR_INTERNAL_URL=http://ds-connector:30001EDR query flow:
- Detects
Edc-Contract-Agreement-IdandEdc-Bpnheaders - Calls
ds-connector GET /internal/agreements/{id}/status— checks the agreement is active - If the dataset has a
user_filter_column, callsds-connector GET /internal/consent/check— retrieves the list of subject IDs the consumer has consent for - Injects an SQL
IN (subject_ids)predicate or a deny plan into the row filter pipeline - Skips the Keycloak/OPA path entirely — the EDC data plane already validated the EDR JWT
This path requires no JWT re-validation by dataset-api since the EDC data plane validates the bearer token before proxying.
Access levels:
open— no authentication required;downloadURLexposed in DCATinternal— JWT required;ds:accessScope eq "dataspaces.query"constraint in ODRLrestricted— JWT + contract required;ds:contractRequired eq "true"in ODRLsecret— not exposed in catalogue or EDC
Row-level filtering via the pluggable governance handler registry. Four built-in handlers are supported:
direct_user_match— filter by user columnrec_registry— lookup via REC registryhttp_in_list— HTTP-based allow listtable_pointer— table-based lookup
Users in the admins group bypass row filters entirely. Service accounts bypass the rec_registry filter.
Governance overrides are supported via governance.<app_name>.yaml files merged with the base governance.yaml.
- OpenLineage ingestion via Marquez
- Namespace-based dataset grouping
- Governance facets embedded in lineage events (
userFilterColumn,medallion,classification) - Provenance surfaced in catalogue metadata
- JSON Schema (2020-12) generated from physical tables
- Column-level metadata for UI and clients
GET /catalogue— DCAT-AP catalogue (application/ld+json)GET /catalogue/{id}— single datasetPOST /catalogue/search— filtered searchPOST /query— governed SQL query; EDR-gated whenEDR_ENABLED=truePOST /admin/catalogue— catalogue importGET /health
The CLI is the primary control plane for the Dataset API:
dataset-cli --helpMain commands:
export openlineage— extract lineage from Marquezexport governance— export governance rules to dataset entriesexport postgres— generate catalogue YAML from PostgreSQL schema introspectionimport catalogue— validate and import dataset cataloguerow-filter add|remove|list— manage row filters in exported YAML files
The export governance command reads governance.yaml files and propagates dcat: and dataspace: blocks to DatasetEntry records. The expose: true field on a source entry controls whether the dataset is visible in the catalogue and registered in EDC.
Dataset-api reads governance rules resolved by celine-utils GovernanceResolver. The following extended blocks are supported:
dcat: block — DCAT-AP metadata:
publisher_uri— overrides the settings-level fallbackthemes—dcat:themeURIs (EU Publications Office vocabulary)language_uris—dct:languageURIsspatial_uris—dct:spatialURIsaccrual_periodicity—dct:accrualPeriodicityURIconforms_to—dct:conformsToURItemporal—dct:temporalwithstartandenddates
dataspace: block — access control and ODRL hints:
contract_required— addsds:contractRequiredconstraint to ODRLconsent_required— addsds:consentStatus eq activeconstraintodrl_action— default action for the ODRL offerpurpose— purpose values for ODRL purpose constraintsmedallion— data quality level
expose: true on the source entry (top-level, not under dataspace:) makes the dataset visible in the catalogue.
- Python >= 3.12
- Async SQLAlchemy
- Pydantic v2
- FastAPI + httpx
- sqlglot-based SQL validation
Before opening a PR:
- validate all YAML definitions
- add tests for new API behaviour
- include migrations for schema changes
- keep docs in sync with API behaviour
Copyright © 2025 Spindox Labs
Licensed under the Apache License, Version 2.0.