diff --git a/docs/.nav.yml b/docs/.nav.yml new file mode 100644 index 0000000..4570a41 --- /dev/null +++ b/docs/.nav.yml @@ -0,0 +1,4 @@ +nav: + - Home: index.md + - Calypr: calypr/ + - Tools: tools/ diff --git a/docs/assets/banner.png b/docs/assets/banner.png new file mode 100644 index 0000000..2d2525d Binary files /dev/null and b/docs/assets/banner.png differ diff --git a/docs/assets/banner_fade.png b/docs/assets/banner_fade.png new file mode 100644 index 0000000..b66a8cf Binary files /dev/null and b/docs/assets/banner_fade.png differ diff --git a/docs/assets/calypr_family.png b/docs/assets/calypr_family.png new file mode 100644 index 0000000..c33677c Binary files /dev/null and b/docs/assets/calypr_family.png differ diff --git a/docs/assets/funnel.png b/docs/assets/funnel.png new file mode 100644 index 0000000..6830652 Binary files /dev/null and b/docs/assets/funnel.png differ diff --git a/docs/assets/git-drs.png b/docs/assets/git-drs.png new file mode 100644 index 0000000..3ed8d9d Binary files /dev/null and b/docs/assets/git-drs.png differ diff --git a/docs/assets/grip.png b/docs/assets/grip.png new file mode 100644 index 0000000..aeb6f9b Binary files /dev/null and b/docs/assets/grip.png differ diff --git a/docs/assets/logo.png b/docs/assets/logo.png new file mode 100644 index 0000000..61ac554 Binary files /dev/null and b/docs/assets/logo.png differ diff --git a/docs/calypr/.nav.yml b/docs/calypr/.nav.yml new file mode 100644 index 0000000..53c3696 --- /dev/null +++ b/docs/calypr/.nav.yml @@ -0,0 +1,7 @@ +title: Calypr +nav: + - Quick Start Guide: quick-start.md + - Data: data/ + - Project Management: project-management/ + - Analysis: analysis/ + - Website: website/ diff --git a/docs/calypr/analysis/.nav.yml b/docs/calypr/analysis/.nav.yml new file mode 100644 index 0000000..270dace --- /dev/null +++ b/docs/calypr/analysis/.nav.yml @@ -0,0 +1,2 @@ +nav: + - Data Querying + Gen3 SDK: query.md diff --git a/docs/workflows/query.md b/docs/calypr/analysis/query.md similarity index 94% rename from docs/workflows/query.md rename to docs/calypr/analysis/query.md index 62df636..ae38ffc 100644 --- a/docs/workflows/query.md +++ b/docs/calypr/analysis/query.md @@ -12,20 +12,20 @@ Gen3 supports API access to Files and Metadata, allowing users to download and q ## 1. Dependency and Credentials -Prior to installing, check a profile credentials. +Prior to querying, ensure your DRS remotes are configured. Test: ```bash -g3t ping +git drs remote list ``` -- will return a list of projects that a profile has access to. +- will return a list of configured DRS remotes and projects you have access to. - For new setup or renew of gen3 credentials - Follow steps to configure/re-configure a profile with credentials: - - Download an API Key from the [Profile page](https://calypr.ohsu.edu.org/identity) and save it to `~/.gen3/credentials.json` + - Download an API Key from the [Profile page](https://calypr-public.ohsu.edu/Profile) and save it to `~/.gen3/credentials.json` - ![Gen3 Profile page](../images/api-key.png) + ![Gen3 Profile page](../../images/api-key.png) - ![Gen3 Credentials](../images/credentials.png) + ![Gen3 Credentials](../../images/credentials.png) ## 2. Install diff --git a/docs/calypr/data/.nav.yml b/docs/calypr/data/.nav.yml new file mode 100644 index 0000000..0263e70 --- /dev/null +++ b/docs/calypr/data/.nav.yml @@ -0,0 +1,5 @@ +nav: + - FHIR for Researchers: introduction.md + - Adding FHIR Metadata: adding-metadata.md + - Managing Metadata: managing-metadata.md + - Integrating Your Data: integration.md diff --git a/docs/workflows/metadata.md b/docs/calypr/data/adding-metadata.md similarity index 73% rename from docs/workflows/metadata.md rename to docs/calypr/data/adding-metadata.md index f44204a..5d24cd6 100644 --- a/docs/workflows/metadata.md +++ b/docs/calypr/data/adding-metadata.md @@ -1,22 +1,21 @@ # Adding FHIR metadata - ## Background Adding files to a project is a two-step process: -1. Adding file metadata entries to the manifest (see [adding files](add-files.md)) +1. Adding file metadata entries to the manifest 2. Creating FHIR-compliant metadata using the manifest -This page will guide you through the second step of generating FHIR metadata in your g3t project. To understand the FHIR data model, see [FHIR for Researchers](../data-model/introduction.md) +This page will guide you through the second step of generating FHIR metadata in your `git-drs` project. To understand the FHIR data model, see [FHIR for Researchers](introduction.md) -## Generating FHIR Data using g3t +## Generating FHIR Data using git-drs -To submit metadata from the manifest to the platform, that metadata needs to be converted into FHIR standard. We will use the file metadata entries we had created during the `g3t add` on our data files. +To submit metadata from the manifest to the platform, that metadata needs to be converted into FHIR standard. We will use the file metadata entries we had created during the `git drs add` on our data files. ### Creating metadata files using the manifest -Using the file metadata entries created by the `g3t add` command, `g3t meta init` creates FHIR-compliant metadata files in the `META/` directory, where each file corresponds to a [FHIR resource](https://build.fhir.org/resourcelist.html). At a minimum, this directory will create: +Using the file metadata entries created by the `git drs add` command, `forge meta init` creates FHIR-compliant metadata files in the `META/` directory, where each file corresponds to a [FHIR resource](https://build.fhir.org/resourcelist.html). At a minimum, this directory will create: | File | Contents | |--------------------------|----------------------------| @@ -32,19 +31,18 @@ Depending on if a `patient` or `specimen` flag was specified, other resources ca * measurements (Observation) -* This command will create a skeleton metadata file for each file added to the project using the `patient`, `specimen`, `task`, and/or `observation` flags specified by the `g3t add` command. +* This command will create a skeleton metadata file for each file added to the project using the `patient`, `specimen`, `task`, and/or `observation` flags specified by the `git drs add` command. * You can edit the metadata to map additional fields. * The metadata files can be created at any time, but the system will validate them before the changes are committed. * **Note:** If an existing file is modified, it won't get automatically staged - For instance, if `DocumentReference.json` is already created and it has to be updated to reflect an additional file, this change is not automatically staged. - - Make sure to either `git add META/` or use the `-a` flag in `g3t commit` to ensure that your FHIR metadata changes are staged. + - Make sure to either `git add META/` to ensure that your FHIR metadata changes are staged. ### Example To add a cram file that's associated with a subject, sample, and particular task ```sh -g3t add myfile.cram --patient P0 --specimen P0-BoneMarrow --task_id P0-Sequencing -g3t meta init +git add myfile.cram --patient P0 --specimen P0-BoneMarrow --task_id P0-Sequencing ``` This will produce metadata with the following relationships: @@ -54,8 +52,8 @@ This will produce metadata with the following relationships: When the project is committed, the system will validate new or changed records. You may validate the metadata on demand by: ```sh -$ g3t meta validate --help -Usage: g3t meta validate [OPTIONS] DIRECTORY +$ forge meta validate --help +Usage: forge meta validate [OPTIONS] DIRECTORY Validate FHIR data in DIRECTORY. @@ -73,16 +71,15 @@ All FHIR metadata is housed in the `META/` directory. The convention of using a ## Supplying your own FHIR metadata -In some cases, it might be useful to supply your own FHIR metadata without using `g3t add` to create any file metadata. In that case, adding metadata would take on the following flow: +In some cases, it might be useful to supply your own FHIR metadata without using `git drs add` to create any file metadata. In that case, adding metadata would take on the following flow: 1. Initialize your project 2. Copy external FHIR data as `.ndjson` files to your `META/` directory 3. `git add META/` -4. `g3t commit -m "supplying FHIR metadata"` +4. `git commit -m "supplying FHIR metadata"` This process would be useful for individuals who want to use the system to track relations between metadata but might not necessarily want to connect their actual data files to the system. ## Next Steps -* See the tabular metadata section for more information on working with metadata. -* See the commit and push section for more information on publishing. \ No newline at end of file +See the [data management section](managing-metadata.md) for more information on working with metadata and publishing. \ No newline at end of file diff --git a/docs/calypr/data/git-drs.md b/docs/calypr/data/git-drs.md new file mode 100644 index 0000000..3a3c724 --- /dev/null +++ b/docs/calypr/data/git-drs.md @@ -0,0 +1,34 @@ +# Git-DRS + +!!! note + The tools listed here are under development and may be subject to change. + +## Overview + +Use case: As an analyst, in order to share data with collaborators, I need a way to create a project, upload files and associate those files with metadata. The system should be capable of adding files in an incremental manner. + +The following guide details the steps a data contributor must take to submit a project to the CALYPR data commons. + +### Core Concepts + +> In a Gen3 data commons, a semantic distinction is made between two types of data: "data files" and "metadata". [more](https://gen3.org/resources/user/dictionary/#understanding-data-representation-in-gen3) + +* **Data File**: Information like tabulated data values in a spreadsheet or a fastq/bam file containing DNA sequences. The contents are not exposed to the API as queryable properties. +* **Metadata**: Variables that help to organize or convey additional information about corresponding data files so they can be queried. + +## 1. Setup + +CALYPR project management is handled using standard Git workflows. you will need the **Large File Storage (LFS)** plugin to track genomic data files and the **Git-DRS** plugin to interface with CALYPR's storage and indexing systems. + +Visit the [Quick Start Guide](../quick-start.md) for detailed, OS-specific installation instructions for these tools. + +| Tool | Purpose | +| :--- | :--- | +| **git-drs** | Manages large file tracking, storage, and DRS indexing. | +| **forge** | Handles metadata validation, transformation (ETL), and publishing. | +| **data-client** | Administrative tool for managing [collaborators and access requests](../../tools/data-client/access_requests.md). | +{: .caption } + +## Git DRS Workflows + +For complete Git DRS documentation including project initialization, file management, and upload workflows, see the [Git DRS Quick Start](../../tools/git-drs/quickstart.md). \ No newline at end of file diff --git a/docs/data-model/integration.md b/docs/calypr/data/integration.md similarity index 87% rename from docs/data-model/integration.md rename to docs/calypr/data/integration.md index b3f9608..575b8a8 100644 --- a/docs/data-model/integration.md +++ b/docs/calypr/data/integration.md @@ -5,9 +5,9 @@ Converting tabular data (CSV, TSV, spreadsheet, database table) into FHIR (Fast As you create a upload files, you can tag them with identifiers which by default will create minimal, skeleton graph. -You can retrieve that data using the g3t command line tool, and update the metadata to create a more complete graph representing your study. +You can retrieve that data using the [git-drs](../../tools/git-drs/index.md) command line tool, and update the metadata using [forge](../../tools/forge/index.md) to create a more complete graph representing your study. -You may choose to work with the data in it's "native" json format, or convert it to a tabular format for integration. The system will re-convert tabular data back to json for submittal. +You may choose to work with the data in its "native" JSON format, or convert it to a tabular format for integration. The system will re-convert tabular data back to JSON for submittal. The process of integrating your data into the graph involves several steps: @@ -24,12 +24,12 @@ The process of integrating your data into the graph involves several steps: * Normalize Data: Split the spreadsheet data into FHIR-compliant resources. * Step 4: Utilize provided FHIR Tooling or Libraries - * FHIR Tooling: Use `g3t meta dataframe ` and associated libraries to support data conversion and validation. - * Validation: Use `g3t meta validate` to validate the transformed data against FHIR specifications to ensure compliance and accuracy. + * FHIR Tooling: Use `forge meta` and associated libraries to support data conversion and validation. + * Validation: Use `forge validate` to validate the transformed data against FHIR specifications to ensure compliance and accuracy. * Step 5: Import into FHIR-Compatible System - * Load Data: Use `g3t commit` to load the transformed data into the calypr system. - * Testing and Verification: Use `g3t push` to ensure your data appears correctly in the portal and analysis tools. + * Load Data: Use `git commit` and `git push` to manage your local data state. + * Testing and Verification: Ensure your data appears correctly in the portal and analysis tools after a successful push. * Step 6: Iterate and Refine * Review and Refine: Check for any discrepancies or issues during the import process. Refine the conversion process as needed. @@ -76,7 +76,7 @@ Identifiers in FHIR references typically include the following components: [see > A string, typically numeric or alphanumeric, that is associated with a single object or entity within a given system. Typically, identifiers are used to connect content in resources to external content available in other frameworks or protocols. -System: Indicates the system or namespace to which the identifier belongs. By default the namespace is `http://calypr.ohsu.edu.org/`. +System: Indicates the system or namespace to which the identifier belongs. By default the namespace is `http://calypr-public.ohsu.edu/`. Value: The actual value of the identifier within the specified system. For instance, a lab controlled subject identifier or a specimen identifier. @@ -109,4 +109,4 @@ By using identifiers in references, FHIR ensures that data can be accurately lin > A reference to a document of any kind for any purpose. [see more](https://hl7.org/fhir/documentreference.html) -See the metadata workflow section for more information on how to create and upload metadata. +See the [data management section](managing-metadata.md) for more information on how to create and upload metadata. diff --git a/docs/data-model/introduction.md b/docs/calypr/data/introduction.md similarity index 74% rename from docs/data-model/introduction.md rename to docs/calypr/data/introduction.md index 96599e2..06f5d72 100644 --- a/docs/data-model/introduction.md +++ b/docs/calypr/data/introduction.md @@ -5,6 +5,12 @@ Given all of the intricacies healthcare and experimental data, we use Fast Healt ## What is FHIR? +> In a Gen3 data commons, a semantic distinction is made between two types of data: "data files" and "metadata". [more](https://gen3.org/resources/user/dictionary/#understanding-data-representation-in-gen3) + +A "data file" could be information like tabulated data values in a spreadsheet or a fastq/bam file containing DNA sequences. The contents of the file are not exposed to the API as queryable properties, so the file must be downloaded to view its content. + +"Metadata" are variables that help to organize or convey additional information about corresponding data files so that they can be queried via the Gen3 data commons’ API or viewed in the Gen3 data commons’ data exploration tool. In a Gen3 data dictionary, variable names are termed "properties", and data contributors provide the values for these pre-defined properties in their data submissions. + In an era where healthcare information is abundant yet diverse and often siloed, FHIR emerges as a standard, empowering research analysts to navigate, aggregate, and interpret health data seamlessly. This guide aims to unravel the intricacies of FHIR, equipping research analysts with the knowledge and tools needed to harness the potential of interoperable healthcare data for insightful analysis and impactful research outcomes in the context of CALYPR collaborations. ## Graph Model @@ -21,7 +27,7 @@ The following "file focused" example illustrates how CALYPR uses FHIR resources Examine [resource](https://www.hl7.org/fhir/resource.html) definitions [here](http://www.hl7.org/fhir/resource.html): -* Details on [uploaded files](https://calypr.github.io/workflows/upload/) are captured as [DocumentReference](http://www.hl7.org/fhir/documentreference.html) +* Details on uploaded files are captured as [DocumentReference](http://www.hl7.org/fhir/documentreference.html) * DocumentReference.[subject](https://www.hl7.org/fhir/documentreference-definitions.html#DocumentReference.subject) indicates who or what the document is about: * Can simply point to the [ResearchStudy](https://hl7.org/fhir/researchstudy.html), to indicate the file is part of the study @@ -31,6 +37,4 @@ Examine [resource](https://www.hl7.org/fhir/resource.html) definitions [here](ht Each resource has at least one study controlled [official](https://hl7.org/fhir/codesystem-identifier-use.html#identifier-use-official) [Identifier](https://hl7.org/fhir/datatypes.html#Identifier). Child resources have [Reference](http://www.hl7.org/fhir/references.html) fields to point to their parent. - - diff --git a/docs/calypr/data/managing-metadata.md b/docs/calypr/data/managing-metadata.md new file mode 100644 index 0000000..e6b85b6 --- /dev/null +++ b/docs/calypr/data/managing-metadata.md @@ -0,0 +1,182 @@ +# Managing Metadata + +Metadata in Calypr is formatted using the Fast Healthcare Interoperability Resources (FHIR) schema. If you choose to bring your own FHIR newline delimited json data, you will need to create a directory called “META” in your git-drs repository in the same directory that you initialized your git-drs repository, and place your metadata files in that directory. +The META/ folder contains newline-delimited JSON (.ndjson) files representing FHIR resources describing the project, its data, and related entities. Large files are tracked using Git LFS, with a required correlation between each data file and a DocumentReference resource. This project follows a standardized structure to manage large research data files and associated FHIR metadata in a version-controlled, DRS and FHIR compatible format. +Each file must contain only one type of FHIR resource type, for example META/ResearchStudy.ndjson only contains research study resource typed FHIR objects. The name of the file doesn’t have to match the resource type name, unless you bring your own document references, then you must use DocumentReference.ndjson. For all other FHIR file types this is simply a good organizational practice for organizing your FHIR metadata. + +## META/ResearchStudy.ndjson + +* The File directory structure root research study is based on the 1st Research Study in the document. This research study is the research study that the autogenerated document references are connected to. Any additional research studies that are provided will be ignored when populating the miller table file tree. +* Contains at least one FHIR ResearchStudy resource describing the project. +* Defines project identifiers, title, description, and key attributes. + +## META/DocumentReference.ndjson + +* Contains one FHIR DocumentReference resource per Git LFS-managed file. +* Each `DocumentReference.content.attachment.url` field: + * Must exactly match the relative path of the corresponding file in the repository. + * Example: + +```json +{ + "resourceType": "DocumentReference", + "id": "docref-file1", + "status": "current", + "content": [ + { + "attachment": { + "url": "data/file1.bam", + "title": "BAM file for Sample X" + } + } + ] +} +``` + +Place your custom FHIR `.ndjson` files in the `META/` directory: + +```bash +# Copy your prepared FHIR metadata +cp ~/my-data/patients.ndjson META/ +cp ~/my-data/observations.ndjson META/ +cp ~/my-data/specimens.ndjson META/ +cp ~/my-data/document-references.ndjson META/ +``` + +## Other FHIR data + +\[TODO More intro text here\] + +* Patient.ndjson: Participant records. +* Specimen.ndjson: Biological specimens. +* ServiceRequest.ndjson: Requested procedures. +* Observation.ndjson: Measurements or results. +* Other valid FHIR resource types as required. + +Ensure your FHIR `DocumentReference` resources reference the DRS URIs: + +Example `DocumentReference` linking to S3 file: + +```json +{ + "resourceType": "DocumentReference", + "id": "doc-001", + "status": "current", + "content": [{ + "attachment": { + "url": "drs://calypr-public.ohsu.edu/your-drs-id", + "title": "sample1.bam", + "contentType": "application/octet-stream" + } + }], + "subject": { + "reference": "Patient/patient-001" + } +} +``` + + +--- + +## Validating Metadata + +To ensure that the FHIR files you have added to the project are correct and pass schema checking, you can use the [Forge tool](../../tools/forge/index.md). + +```bash +forge validate +``` + +Successful output: + +✓ Validating META/patients.ndjson... OK +✓ Validating META/observations.ndjson... OK +✓ Validating META/specimens.ndjson... OK +✓ Validating META/document-references.ndjson... OK +All metadata files are valid. + +Fix any validation errors and re-run until all files pass. + + +### Forge Data Quality Assurance Command Line Commands + +If you have provided your own FHIR resources there are two commands that might be useful to you for ensuring that your FHIR metadata will appear on the CALYPR data platform as expected. These commands are validate and check-edge + +**Validate:** +```bash +forge validate META +# or +forge validate META/DocumentReference.ndjson +``` +Validation checks if the provided directory or file will be accepted by the CALYPR data platform. It catches improper JSON formatting and FHIR schema errors. + +**Check-edge:** +```bash +forge check-edge META +# or +forge validate META/DocumentReference.ndjson +``` +Check-edge ensures that references within your files (e.g., a Patient ID in an Observation) connect to known vertices and aren't "orphaned". + +### Validation Process + +#### 1\. Schema Validation + +* Each .ndjson file in META/ (like ResearchStudy.ndjson, DocumentReference.ndjson, etc.) is read line by line. +* Every line is parsed as JSON and checked against the corresponding FHIR schema for that resourceType. +* Syntax errors, missing required fields, or invalid FHIR values trigger clear error messages with line numbers. + +#### 2\. Mandatory Files Presence + +* Confirms that: + * ResearchStudy.ndjson exists and has at least one valid record. + * DocumentReference.ndjson exists and contains at least one record. +* If either is missing or empty, validation fails. + +#### 3\. One-to-One Mapping of Files to DocumentReference + +* Scans the working directory for Git LFS-managed files in expected locations (e.g., data/). +* For each file, locates a corresponding DocumentReference resource whose content.attachment.url matches the file’s relative path. +* Validates: + * All LFS files have a matching DocumentReference. + * All DocumentReferences point to existing files. + +#### 4\. Project-level Referential Checks + +* Validates that DocumentReference resources reference the same ResearchStudy via relatesTo or other linking mechanisms. +* If FHIR resources like Patient, Specimen, ServiceRequest, Observation are present, ensures: + * Their id fields are unique. + * DocumentReference correctly refers to those resources (e.g., via subject or related fields). + +#### 5\. Cross-Entity Consistency + +* If multiple optional FHIR .ndjson files exist: + * Confirms IDs referenced in one file exist in others. + * Detects dangling references (e.g., a DocumentReference.patient ID that's not in Patient.ndjson). + +--- + +#### ✅ Example Error Output + +ERROR META/DocumentReference.ndjson line 4: url "data/some\_missing.bam" does not resolve to an existing file +ERROR META/Specimen.ndjson line 2: id "specimen-123" referenced in Observation.ndjson but not defined + +--- + +#### 🎯 Purpose & Benefits + +* Ensures all files and metadata are in sync before submission. +* Prevents submission failures due to missing pointers or invalid FHIR payloads. +* Enables CI integration, catching issues early in the development workflow. + +--- + +#### Validation Requirements + +Automated tools or CI processes must: + +* Verify presence of META/ResearchStudy.ndjson with at least one record. +* Verify presence of META/DocumentReference.ndjson with one record per LFS-managed file. +* Confirm every DocumentReference.url matches an existing file path. +* Check proper .ndjson formatting. + +--- \ No newline at end of file diff --git a/docs/calypr/index.md b/docs/calypr/index.md new file mode 100644 index 0000000..d0ee9f2 --- /dev/null +++ b/docs/calypr/index.md @@ -0,0 +1,43 @@ +# CALYPR Platform + +Welcome to the **CALYPR Platform**. CALYPR is a next-generation genomic data science ecosystem designed to bridge the gap between massive, centralized data commons and the agile, distributed workflows of modern researchers. + + +## The CALYPR Philosophy + +Traditional data repositories often create data silos where information is easy to store but difficult to move, version, or integrate with external tools. CALYPR breaks these silos by embracing **Interoperability**, **Reproducibility**, and **Scalability**. + +### 1. Interoperability (GA4GH Standards) +CALYPR is built from the ground up on [GA4GH](https://www.ga4gh.org/) standards. By using the **Data Repository Service (DRS)** and **Task Execution Service (TES)**, CALYPR ensures that your data and workflows can move seamlessly between different cloud providers and on-premises high-performance computing (HPC) clusters. +The data system is based on the Fast Healthcare Interoperability Resources (FHIR) standard. + +### 2. Reproducibility (Git-like Data Management) +The core of the CALYPR experience is **Git-DRS**. We believe that data management should feel as natural as code management. Git-DRS allows you to track, version, and share massive genomic datasets using the familiar `git` commands, ensuring that every analysis is backed by a specific, immutable version of the data. + +### 3. Scalability (Hybrid Cloud Infrastructure) +Whether you are working with a few genomes or petabyte-scale cohorts, CALYPR's architecture—powered by **Gen3**—scales to meet your needs. Our hybrid cloud approach allows for secure data storage in AWS while leveraging your local compute resources when necessary. + +--- + +## How it Works: The Connected Commons + +CALYPR acts as the "connective tissue" between your research environment and the cloud: + +* **Data Commons ([Gen3](https://gen3.org)):** Provides the robust backend for metadata management, indexing, and authentication. +* **Version Control ([Git-DRS](../tools/git-drs/index.md)):** Manages the check-in and check-out operations for large files, allowing you to treat remote DRS objects as local files. +* **Metadata Orchestration ([Forge](../tools/forge/index.md)):** Streamlines the validation, publishing, and harmonizing of genomic metadata. +* **Compute ([Funnel](../tools/funnel/index.md)):** Executes complex pipelines across distributed environments using standardized task definitions. +* **Graph Insights ([GRIP](../tools/grip/index.md)):** Enables high-performance queries across heterogeneous datasets once integrated. + +--- + +!!! info "Private Beta" + CALYPR platform is currently in a private beta phase. We are actively working with a select group of research partners to refine the platform. If you encounter any issues or have feature requests, please reach out to the team. The individual [tools](../tools/index.md) are available for public use. + +--- + +## Next Steps +To get started: + +1. **[Quick Start Guide](quick-start.md):** The fastest way to install tools and start tracking data. +2. **[Data & Metadata](data/managing-metadata.md):** Learn how to associate your biological metadata with the files you've uploaded. \ No newline at end of file diff --git a/docs/calypr/project-management/.nav.yml b/docs/calypr/project-management/.nav.yml new file mode 100644 index 0000000..ce24a4e --- /dev/null +++ b/docs/calypr/project-management/.nav.yml @@ -0,0 +1,4 @@ +nav: + - Project Customization: custom-views.md + - Publishing project: publishing-project.md + - Calypr Admin: calypr-admin/ diff --git a/docs/calypr/project-management/calypr-admin/.nav.yml b/docs/calypr/project-management/calypr-admin/.nav.yml new file mode 100644 index 0000000..bfa1166 --- /dev/null +++ b/docs/calypr/project-management/calypr-admin/.nav.yml @@ -0,0 +1,3 @@ +nav: + - Add users: add-users.md + - Role Based Access Control: approve-requests.md diff --git a/docs/workflows/add-users.md b/docs/calypr/project-management/calypr-admin/add-users.md similarity index 72% rename from docs/workflows/add-users.md rename to docs/calypr/project-management/calypr-admin/add-users.md index 86cedac..8589726 100644 --- a/docs/workflows/add-users.md +++ b/docs/calypr/project-management/calypr-admin/add-users.md @@ -3,7 +3,7 @@ ## Granting user access to a project Once a project has been created you will have full access to it. -The project owner can add additional users to the project using the `g3t collaborator` commands. +The project owner can add additional users to the project using the `data-client collaborator` commands. There are two ways to request the addition additional users to the project: @@ -12,15 +12,14 @@ There are two ways to request the addition additional users to the project: To give another user full access to the project, run the following: ```sh -g3t collaborator add --write user-can-write@example.com +data-client collaborator add [project_id] user-can-write@example.com --write ``` Alternatively, to give another user read access only (without the ability to upload to the project), run the following: ```sh -g3t collaborator add user-read-only@example.com +data-client collaborator add [project_id] user-read-only@example.com ``` ## 2. Approvals -In order to implement these requests, **an authorized user will need to sign** the request before the user can use the remote repository. See `g3t collaborator approve --help -` +In order to implement these requests, **an authorized user will need to sign** the request before the user can use the remote repository. See `data-client collaborator approve --help` diff --git a/docs/workflows/approve-requests.md b/docs/calypr/project-management/calypr-admin/approve-requests.md similarity index 90% rename from docs/workflows/approve-requests.md rename to docs/calypr/project-management/calypr-admin/approve-requests.md index 0c836d0..2577139 100644 --- a/docs/workflows/approve-requests.md +++ b/docs/calypr/project-management/calypr-admin/approve-requests.md @@ -16,8 +16,8 @@ * Ony users with the steward role can approve and sign a request ```text -g3t collaborator approve --help -Usage: g3t collaborator approve [OPTIONS] +./data-client collaborator approve --help +Usage: ./data-client collaborator approve [OPTIONS] Sign an existing request (privileged). @@ -40,9 +40,9 @@ Note: This example uses the ohsu program, but the same process applies to all pr ```text ## As an admin, I need to grant data steward privileges add the requester reader and updater role on a program to an un-privileged user -g3t collaborator add add data_steward_example@.edu --resource_path /programs//projects --steward +./data-client collaborator add add data_steward_example@.edu --resource_path /programs//projects --steward # As an admin, approve that request -g3t collaborator approve +./data-client collaborator approve diff --git a/docs/calypr/project-management/create-project.md b/docs/calypr/project-management/create-project.md new file mode 100644 index 0000000..474f144 --- /dev/null +++ b/docs/calypr/project-management/create-project.md @@ -0,0 +1,17 @@ + + +# Create a Project (gen3 \+ GitHub) + +Status: *Manual and DevOps‑only at the moment* + +The standard way to start a new Calypr project is to create a Git repository that will hold your FHIR NDJSON files and a set of Git‑LFS tracked files. + +For now you will need to ask a Calypr management team to create the project and provide you with the following: + +* GitHub repository URL +* Calypr project ID +* Initial git config settings (branch, remotes, etc.) + +Future Work: Automate this step with a CLI wizard. + +TODO – Write the DevOps‑only project creation guide. diff --git a/docs/calypr/project-management/custom-views.md b/docs/calypr/project-management/custom-views.md new file mode 100644 index 0000000..afebfdf --- /dev/null +++ b/docs/calypr/project-management/custom-views.md @@ -0,0 +1,186 @@ + +# Project Customization + +## Dataframer Configuration + +The dataframer is used to render the FHIR `.ndjson` files into the tabular space that is used in the explorer page table. If you want to customize your project’s explorer page you will need to specify database field names that are defined in the dataframer, thus you will need to run the dataframer on your data ahead of time in order to know these field names. + +See below steps for setting up `git-drs` and running dataframer commands: + +```bash +python -m venv venv +source venv/bin/activate +pip install gen3-tracker==0.0.7rc27 +git-drs meta dataframe DocumentReference +``` + +The explorer config is a large JSON document. One of the keys of note is `guppyConfig`, which is used to specify what index is to be used for the explorer page tab that you have defined. Notice that when you run `git-drs meta dataframe` it outputs: + +```text +Usage: git-drs meta dataframe [OPTIONS] {Specimen|DocumentReference|ResearchSubject|MedicationAdministration|GroupMember} [DIRECTORY_PATH] [OUTPUT_PATH] + +Try 'git-drs meta dataframe --help' for help. +``` + +Where `Specimen`, `DocumentReference`, etc. are the supported indices that can be run in the dataframe and defined in the `explorerConfig` under the `guppyConfig` key. + +Note that the `guppyConfig` index names use `snake_case` formatting whereas the dataframer uses uppercase for each word. + +## 5.2 Explorer Page Configuration + +Forge currently supports customization of explorer pages by routing to: `https://commons-url/Explorer/[program]-[project]` + +Explorer Configs can be customized by running `forge config init` and then filling out the template configuration. + +The explorer config is a JSON document with a top-level key called `explorerConfig` which can host a list of "tab" configs. The tabs (e.g., "Patient", "Specimen", and "File") each denote an element in this config. + +In this example, the `guppyConfig.dataType` is set to `document_reference`. We ran the `DocumentReference` dataframer command earlier to select database field names from the generated output. + +```json +{ + "explorerConfig": [ + { + "tabTitle": "TEST", + "guppyConfig": { + "dataType": "document_reference", + "nodeCountTitle": "file Count", + "fieldMapping": [] + }, + "filters": { + "tabs": [ + { + "title": "Filters", + "fields": [ + "document_reference_assay", + "document_reference_creation", + "project_id" + ], + "fieldsConfig": { + "project_id": { + "field": "project_id", + "label": "Project Id", + "type": "enum" + }, + "assay": { + "field": "document_reference_assay", + "label": "Assay", + "type": "enum" + }, + "creation": { + "field": "document_reference_creation", + "label": "Creation", + "type": "enum" + } + } + } + ] + }, + "table": { + "enabled": true, + "fields": [ + "project_id", + "document_reference_assay", + "document_reference_creation" + ], + "columns": { + "project_id": { + "field": "project_id", + "title": "Project ID" + }, + "assay": { + "field": "document_reference_assay", + "title": "Assay" + }, + "creation": { + "field": "document_reference_creation", + "title": "Creation" + } + } + }, + "dropdowns": {}, + "buttons": [], + "loginForDownload": false + } + ] +} +``` + +And here is what this config looks like in the frontend: + +Note that since there is only one element in the `explorerConfig` there is only one tab called “TEST” in the explorer page which is housed as `tabTitle` in the config. + +#### Filters + +The next important section is the `filters` key. This defines the filters column on the left-hand side of the page. Within that block there is the `fields` key and the `fieldsConfig` key. The `fields` key is used to specify the names of the fields that you want to filter on. In order to get the names of the fields you will need to install `git-drs` via PyPI and run a dataframer command which essentially creates this explorer table dataframe, so that you can configure in the frontend what parts of this dataframe you want to be shown. + +Now, going back to the configuration, these fields that were specified come directly from the column names at the top of the excel spreadsheet that are generated from running the dataframer command. You can choose any number / combination of these column names, but note that in any list that is specified in this config, the elements in the list are rendered in the frontend in that exact order that is specified. + +The `fieldsConfig` key is a decorator dict that is optional but can be applied to every filter that is specified. Notice that the `label` key is used to denote the preferred display name that is to be used for the database key name that was taken from the dataframer excel spreadsheet. + +#### Table + +The last import section is the `table` key. Like with the filters structure, `fields` is used to denote all of the database column names that should be displayed in the explorer table. Also similar to the filters structure, `columns` is where you specify the label that you want displayed for the database field. In this case it is `field` is the db name and `title` is the label display name. + +The rest of the config is templating that is needed for the explorer page to load, but not anything that is directly useful. + +#### Shared Filters + +Imagine you want to filter on multiple index facets, similar to a RESTFUL join operation. Like for example give me all of the PATIENTS who belong on this `project_id` that also have a specimen that matches this `project_id`. + +This is known as “shared filtering” because you are making the assumption that you want to carry your filters over to the new node when you click a new tab. This only works if there exists an equivalent field on the other index/tab, so it must be configurable and is not applicable for all normal filterable fields. + +It sounds complex but setting it up isn't that complex at all. Simply specify a filter that you want to do shared filtering on, ie: `project_id`, then specify the indices and the field names for each index that the field is shared on. For our purposes `project_id` is known as `project_id` on all indices but this may not always be the case, and proper inspection or knowledge of the dataset may be required to determine this. + +Then you simply specify each “shared filter” as a JSON dictionary list element under the field that you have specified and you have successfully setup shared filtering on that field. In order to define additional shared filters, it is as simple as adding another key under the `defined` dictionary key and specifying a list of indices and fields that the shared filter can be joined on. See the example below for details. + +```json +"sharedFilters": { + "defined": { + "project_id": [ + { "index": "research_subject", "field": "project_id" }, + { "index": "specimen", "field": "project_id" }, + { "index": "document_reference", "field": "project_id" } + ] + } +} +``` + +## 5.3 Configurator + +Now that you have the basics down this frontend GUI might start to make some sense. Notice this is the exact same config that was shown earlier, except it is customizable via the GUI so that you don’t need to wrestle with the JSON to get a working, correctly formatted config. Notice also that there is a 3rd column here: Charts. Charts are defined very simply: + +```json +"charts": { + "specimen_collection": { + "chartType": "fullPie", + "title": "Metastasis Site" + } +} +``` + +Just provide the DB column name as the parent key, and then the chart type and the label title of the chart. The chart will generate a binned histogram counts style chart. Currently only `fullPie`, `bar` or `donut` type charts are supported but in the future other chart types might be added. + +As stated earlier, configs have a very specific naming convention: `[program]-[project].json` and will be rejected if you do not have write permissions on the program, project configuration that is specified or if the name of the configuration is not of that form. You can also load any configs that you have access to too, an edit them and then repost them. + +All customizable explorer pages are viewable when routing to `/Explorer/[program]-[project]` assuming that all database fields that are specified exist in the db. + +# **Advanced Docs** + +--- + +# **🧬 Managing Identifiers with CALYPR Meta** + +This guide explains how to manage dataset identifiers, both manually and through the command line, and how those identifiers integrate with Git-LFS and git-drs for reproducible, FAIR-compliant data management. + +### 🧭 Introduction: Where This Fits in Your Research Data Lifecycle + +This document applies once you’ve begun organizing data files for a research study and are ready to make their metadata machine-readable and FAIR-compliant. Researchers typically progress through several stages: + +1. **Files only**: you start with a set of raw or processed data files associated with a research study. +2. **Files with identifiers**: each file is linked to key entities such as Patients, Specimens, or Assays using `META/identifiers.tsv`. +3. **Files with identifiers + attributes**: you begin adding structured tabular metadata (e.g., `Patient.tsv`, `Specimen.tsv`, `Observation.tsv`) describing those entities. +4. **Files with complete FHIR metadata**: you can now transform these TSVs into fully-formed FHIR resources (`Patient.ndjson`, `Specimen.ndjson`, etc.) suitable for sharing, indexing, and integration with clinical or genomic data platforms. + +This guide focuses on stage 2, 3 — converting well-structured TSV metadata files into standard FHIR resources, while validating that every entity’s identifier corresponds to the entries defined in `META/identifiers.tsv`. + +--- \ No newline at end of file diff --git a/docs/calypr/project-management/publishing-project.md b/docs/calypr/project-management/publishing-project.md new file mode 100644 index 0000000..84e1cec --- /dev/null +++ b/docs/calypr/project-management/publishing-project.md @@ -0,0 +1,53 @@ +## 4.6: Publishing changes to Gen3 + +In order to publish metadata to CALYPR, regardless of whether you have provided your own metadata or you are simply uploading files to the system, you will need to publish your data. Publishing data is done with the **Forge** command line utility. + +Since Forge relies on your GitHub repository to know which files should have metadata records on the CALYPR platform, a GitHub Personal Access Token (PAT) is needed. To create your own PAT, login to [https://source.ohsu.edu](https://source.ohsu.edu), go to Settings > Tokens, and click "Generate new token". Make sure the token has `clone` permissions at the minimum. + +To publish, run: +```bash +forge publish [your_PAT] +``` + +### Publishing Process + +To publish your metadata, run the following command: + +```bash +forge publish +``` + +What happens: + +1. Forge validates your GitHub Personal Access Token +2. Packages repository information +3. Submits a Sower job to Gen3 +4. Gen3 ingests FHIR metadata from META/ +5. Metadata becomes searchable in CALYPR + +Successful output: + +✓ Personal Access Token validated +✓ Repository information packaged +✓ Sower job submitted: job-id-12345 +✓ Metadata ingestion started + +Check job status: forge status \ +Get all job ids: forge list + +📖 More details: [Forge](../../tools/forge/index.md) + +--- + +### Verification Checklist + +After completing the workflow: + +* LFS pointer files in Git repository +* DRS records created +* DRS URIs point to S3 locations +* Metadata files validated successfully +* Sower job completed without errors +* Data searchable in CALYPR web interface +* Can query patients/observations in Gen3 +* Files accessible via S3 (no duplicate storage) \ No newline at end of file diff --git a/docs/calypr/quick-start.md b/docs/calypr/quick-start.md new file mode 100644 index 0000000..ae936ba --- /dev/null +++ b/docs/calypr/quick-start.md @@ -0,0 +1,152 @@ +--- +title: Quick Start Guide +--- + +# CALYPR Quick Start Guide + +Welcome to CALYPR! This guide will walk you through the essential workflow for managing and analyzing genomic data on the CALYPR platform. + +## What is CALYPR? + +CALYPR is a genomic data science platform that combines the best of cloud-based data commons with familiar version control tools. Think of it as "Git for genomic data" — you can version, track, and collaborate on massive datasets while maintaining full reproducibility. + +**Key Benefits:** + +- **Version Control**: Track genomic data files like you track code +- **Interoperability**: Built on GA4GH standards (DRS, TES) for seamless data sharing +- **Scalability**: From a few samples to petabyte-scale cohorts +- **Reproducibility**: Every analysis tied to specific versions of data and metadata + +## What You'll Learn + +This guide covers the essential CALYPR workflow: + +1. **Getting access** to the CALYPR platform +2. **Uploading data files** with Git-DRS +3. **Adding metadata** with Forge +4. **Running analyses** with Funnel (optional) +5. **Querying data** with GRIP (optional) + +## Prerequisites + +Before you begin, make sure you have: + +- **Git** installed on your system ([download](https://git-scm.com)) +- **Access to CALYPR** - contact your project administrator for an account +- **Basic command-line experience** - familiarity with terminal/shell commands + +--- + +## The CALYPR Workflow + +### Step 1: Get Your API Credentials + +To interact with CALYPR, you need API credentials from the Gen3 data commons. You'll download these from your profile page on the CALYPR portal as a JSON file. + +API credentials expire after 30 days, so you'll need to download fresh credentials regularly. + +**Learn More:** [Download Gen3 API Credentials](../tools/git-drs/quickstart.md#download-gen3-api-credentials) — Step-by-step instructions with screenshots + +--- + +### Step 2: Upload Your Data Files (Git-DRS) + +**Git-DRS** is CALYPR's data file management tool. It extends Git LFS to version and track large genomic files while automatically registering them with the DRS (Data Repository Service). + +Git-DRS lets you: +- Version large data files (BAM, FASTQ, VCF, etc.) like you version code +- Track file lineage and share data with collaborators +- Automatically register files with DRS for global discovery + +When you push files, Git-DRS uploads them to S3, registers DRS records in Gen3, and stores only lightweight pointer files in your Git repository. + +**Learn More:** [Git-DRS Complete Documentation](../tools/git-drs/quickstart.md) — Installation, setup, and detailed workflows + +--- + +### Step 3: Add Metadata (Forge) + +**Forge** is CALYPR's metadata management tool. It validates and publishes structured metadata about your samples, making your data discoverable and queryable. + +Forge helps you: +- Validate metadata against Gen3 data models +- Publish metadata to make your data searchable +- Manage relationships between samples, subjects, and files + +While you can upload files before metadata, adding metadata early maximizes the value of your data by making it discoverable and queryable. + +**Learn More:** [Forge Documentation](../tools/forge/index.md) — Installation, validation, and publishing workflows + +--- + +### Step 4: Run Analysis Workflows (Funnel) — Optional + +**Funnel** is CALYPR's task execution service. It runs computational workflows across cloud and HPC environments using the GA4GH Task Execution Service (TES) standard. + +Funnel enables you to: +- Execute containerized workflows (Docker/Singularity) +- Manage resources across AWS, GCP, and HPC clusters +- Track task status and integrate with workflow engines (Nextflow, WDL) + +Funnel is typically used for production pipelines and large-scale analysis. For exploratory work, you might run analyses locally first. + +**Learn More:** [Funnel Documentation](../tools/funnel/index.md) — Task definitions, execution, and cluster integration + +--- + +### Step 5: Query Your Data (GRIP) — Optional + +**GRIP** (Graph Resource Integration Platform) enables powerful graph-based queries across integrated datasets. + +GRIP allows you to: +- Query relationships between samples, subjects, and files +- Perform complex graph traversals and aggregations +- Run federated queries across data commons + +GRIP is most useful after you've integrated metadata and established relationships between entities. + +**Learn More:** [GRIP Documentation](../tools/grip/index.md) — Query syntax, graph traversals, and examples + +--- + +## Next Steps + +Now that you understand the basic CALYPR workflow, here are some recommended next steps: + +### 📚 Dive Deeper + +- **[Data Management](data/git-drs.md)** - Advanced Git-DRS workflows +- **[Metadata Guide](data/adding-metadata.md)** - Data modeling and metadata best practices +- **[Project Management](project-management/create-project.md)** - Creating and managing CALYPR projects + +### 🔧 Tool Documentation + +- **[Git-DRS Complete Guide](../tools/git-drs/quickstart.md)** - Comprehensive Git-DRS documentation +- **[Forge Reference](../tools/forge/index.md)** - Metadata validation and publishing +- **[Funnel Workflows](../tools/funnel/index.md)** - Task execution and pipeline management +- **[GRIP Queries](../tools/grip/index.md)** - Graph-based data queries + +### 🆘 Get Help + +- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions +- **[Platform Overview](index.md)** - Learn more about CALYPR architecture + +### 💬 Community + +CALYPR is in active development. Have questions or feedback? Reach out to the CALYPR team or your project administrator. + +--- + +## Summary + +You've learned the essential CALYPR workflow: + +✅ **Access** - Get Gen3 API credentials +✅ **Upload** - Use Git-DRS to version and track data files +✅ **Annotate** - Use Forge to add and validate metadata +✅ **Analyze** - Use Funnel to run computational workflows (optional) +✅ **Query** - Use GRIP to explore data relationships (optional) + +Each tool builds on the previous step, creating a complete data lifecycle from upload to analysis. Start with the basics (access, upload, annotate) and add advanced features as your needs grow. + +Happy analyzing! 🧬 diff --git a/docs/calypr/troubleshooting.md b/docs/calypr/troubleshooting.md new file mode 100644 index 0000000..9b69555 --- /dev/null +++ b/docs/calypr/troubleshooting.md @@ -0,0 +1,61 @@ +# Troubleshooting & FAQ + +Common issues encountered when working with the CALYPR platform and its tools. + +--- + +## Metadata is "out of date" + +**Issue:** When attempting to push or validate, you receive a warning that `DocumentReference.ndjson` or other metadata files are out of date. + +**Resolution:** This typically happens when you have added new data files using `git add` or `git-drs` but haven't updated the corresponding FHIR metadata to reflect these changes. + +1. **Regenerate Metadata:** Use Forge to synchronize your metadata with the current repository state: + ```bash + forge meta init + ``` +2. **Stage Changes:** Ensure the updated metadata files in the `META/` directory are staged: + ```bash + git add META/ + ``` +3. **Commit:** + ```bash + git commit -m "Update metadata for new files" + ``` + +--- + +## No new files to index + +**Issue:** Running `git push` or a registration command returns "No new files to index." + +**Resolution:** This indicates that the current state of your files is already synchronized with the remote server. If you need to force an update to the metadata or re-register existing files, use the specific tool's overwrite flag (e.g., `git drs push --overwrite`). + +--- + +## Uncommitted changes preventing push + +**Issue:** You receive an error about "Uncommitted changes found" when trying to push data. + +**Resolution:** Standard Git rules apply. If you've run commands that modify the `META/` directory, you must commit those changes before pushing. +```bash +git add META/ +git commit -m "Refining metadata" +git push +``` + +--- + +## Authentication Errors + +**Issue:** Commands fail with "Unauthorized" or "401" errors. + +**Resolution:** +1. **Check Credentials:** Ensure your `credentials.json` is valid and hasn't expired. You can download a fresh key from the [CALYPR Profile Page](https://calypr-public.ohsu.edu/Profile). +2. **Verify Configuration:** Run `git drs remote list` to ensure the correct endpoint and project ID are configured for your current profile. +3. **Token Refresh:** If using temporary tokens, ensure they are still active. + +--- + +!!! tip "Getting Help" + If your issue isn't listed here, please reach out to our team at [support@calypr.org](mailto:support@calypr.org) or search the individual tool documentation in the [Tools Section](../tools/index.md). diff --git a/docs/calypr/website/.nav.yml b/docs/calypr/website/.nav.yml new file mode 100644 index 0000000..f2af265 --- /dev/null +++ b/docs/calypr/website/.nav.yml @@ -0,0 +1,2 @@ +nav: + - Explore: portal-explore.md diff --git a/docs/calypr/website/download-single-file.png b/docs/calypr/website/download-single-file.png new file mode 100644 index 0000000..46f3125 Binary files /dev/null and b/docs/calypr/website/download-single-file.png differ diff --git a/docs/calypr/website/explorer.png b/docs/calypr/website/explorer.png new file mode 100644 index 0000000..d8e89e5 Binary files /dev/null and b/docs/calypr/website/explorer.png differ diff --git a/docs/calypr/website/file-manifest.png b/docs/calypr/website/file-manifest.png new file mode 100644 index 0000000..0e0cb31 Binary files /dev/null and b/docs/calypr/website/file-manifest.png differ diff --git a/docs/calypr/website/portal-download.md b/docs/calypr/website/portal-download.md new file mode 100644 index 0000000..68117ff --- /dev/null +++ b/docs/calypr/website/portal-download.md @@ -0,0 +1,31 @@ +--- +title: Download +--- + +There are two main ways to download files: + +1. Individually through the browser or through the command line with the `gen3-client` +2. Batch downloads through the command line with `git-drs` and `git-lfs` + +This guide will walk you through both methods below. + +--- + +### Batch Download with Git-DRS + +To retrieve the actual data files described by a repository, you must clone the repository and use `git lfs pull`. + +```bash +# 1. Clone the repository +git clone +cd + +# 2. Initialize Git-DRS +git drs init + +# 3. Add the DRS remote (see Quick Start for details) +git drs remote add gen3 calypr --project --bucket --cred ~/.gen3/credentials.json + +# 4. Pull the files +git lfs pull +``` \ No newline at end of file diff --git a/docs/calypr/website/portal-explore.md b/docs/calypr/website/portal-explore.md new file mode 100644 index 0000000..231c253 --- /dev/null +++ b/docs/calypr/website/portal-explore.md @@ -0,0 +1,9 @@ + +# Explore + +The `push` command uploads the metadata associated with the project and makes the files visible on the [Explorer page](https://calypr-public.ohsu.edu/Explorer). + +![Gen3 File Explorer](./explorer.png) + + +See the [portal download page](portal-download.md) for more information on downloading files from the portal. diff --git a/docs/data-model/metadata.md b/docs/data-model/metadata.md deleted file mode 100644 index c12298a..0000000 --- a/docs/data-model/metadata.md +++ /dev/null @@ -1,67 +0,0 @@ -# Creating and Uploading Metadata - -### Create Metadata - -Create basic, minimal metadata for the project: - -```sh -gen3_util meta create /tmp/$PROJECT_ID - -ls -1 /tmp/$PROJECT_ID -DocumentReference.ndjson -Observation.ndjson -Patient.ndjson -ResearchStudy.ndjson -ResearchSubject.ndjson -Specimen.ndjson -Task.ndjson -``` - -### Retrieve existing metadata -Retrieve the existing metadata from the portal. - -```sh - -gen3_util meta cp - -TODO -``` - -### Integrate your data - -Convert the FHIR data to tabular form. - -```sh -TODO -``` - -Convert the tabular data to FHIR. - -```sh -TODO -``` - -Validate the data - -```sh -$ gen3_util meta validate --help -Usage: gen3_util meta validate [OPTIONS] DIRECTORY - - Validate FHIR data in DIRECTORY. - -``` - - - -### Publish the Metadata - -```text -# copy the metadata to the bucket and publish the metadata to the portal -gen3_util meta publish /tmp/$PROJECT_ID -``` - -## View the Files - -This final step uploads the metadata associated with the project and makes the files visible on the [Explorer page](https://calypr.ohsu.edu.org/explorer). - -![Gen3 File Explorer](./explorer.png) diff --git a/docs/getting-started.md b/docs/getting-started.md deleted file mode 100644 index 6aa94c3..0000000 --- a/docs/getting-started.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: Getting Started ---- - -{% include '/note.md' %} - -Use case: As an analyst, in order to share data with collaborators, I need a way to create a project, upload files and associate those files with metadata. The system should be capable of adding files in an incremental manner. - -The following guide details the steps a data contributor must take to submit a project to the CALYPR data commons. - -> In a Gen3 data commons, a semantic distinction is made between two types of data: "data files" and "metadata". [more](https://gen3.org/resources/user/dictionary/#understanding-data-representation-in-gen3) - -A "data file" could be information like tabulated data values in a spreadsheet or a fastq/bam file containing DNA sequences. The contents of the file are not exposed to the API as queryable properties, so the file must be downloaded to view its content. - -"Metadata" are variables that help to organize or convey additional information about corresponding data files so that they can be queried via the Gen3 data commons’ API or viewed in the Gen3 data commons’ data exploration tool. In a Gen3 data dictionary, variable names are termed "properties", and data contributors provide the values for these pre-defined properties in their data submissions. - -For the CALYPR data commons, we have created a data dictionary based on the FHIR data standard. The data dictionary is available [here](https://github.com/bmeg/iceberg-schema-tools) - -## Examples - -> In a Gen3 Data Commons, programs and projects are two administrative nodes in the graph database that serve as the most upstream nodes. A program must be created first, followed by a project. Any subsequent data submission and data access, along with control of access to data, is done through the project scope. -> [more](https://gen3.org/resources/operator/#6-programs-and-projects) - -For the following examples, we will use the `calypr` program with a project called `myproject`, please use the `g3t projects ls` command to verify what programs you have access to. diff --git a/docs/images/api-key.png b/docs/images/api-key.png index bf27e88..6adaf3e 100644 Binary files a/docs/images/api-key.png and b/docs/images/api-key.png differ diff --git a/docs/images/credentials-json.png b/docs/images/credentials-json.png deleted file mode 100644 index c35016c..0000000 Binary files a/docs/images/credentials-json.png and /dev/null differ diff --git a/docs/images/credentials.png b/docs/images/credentials.png index 89f68c7..d1b783e 100644 Binary files a/docs/images/credentials.png and b/docs/images/credentials.png differ diff --git a/docs/images/file-manifest-download copy.png b/docs/images/file-manifest-download copy.png new file mode 100644 index 0000000..d8e89e5 Binary files /dev/null and b/docs/images/file-manifest-download copy.png differ diff --git a/docs/images/file-manifest-download.png b/docs/images/file-manifest-download.png deleted file mode 100644 index 9cc4717..0000000 Binary files a/docs/images/file-manifest-download.png and /dev/null differ diff --git a/docs/images/gripper_architecture.png b/docs/images/gripper_architecture.png new file mode 100644 index 0000000..b35e3db Binary files /dev/null and b/docs/images/gripper_architecture.png differ diff --git a/docs/images/login.png b/docs/images/login.png deleted file mode 100644 index 37af6c9..0000000 Binary files a/docs/images/login.png and /dev/null differ diff --git a/docs/images/profile.png b/docs/images/profile.png index c69fb96..411f06a 100644 Binary files a/docs/images/profile.png and b/docs/images/profile.png differ diff --git a/docs/index.md b/docs/index.md index 7ad8184..e375741 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,16 +1,116 @@ -# Welcome to the CALYPR Documentation +--- +template: home.html +hide: + - navigation + - toc + - header +--- -![CALYPR site](./images/website_header.png) + + -This documentation will walk you through the steps for submitting data to the [CALYPR Data Commons](https://calypr.ohsu.edu.org). +
+

CALYPR Platform

+

+ A scalable, hybrid cloud infrastructure designed for the demands of modern genomics research. + Built on open-source standards, CALYPR provides GA4GH-compliant tools for seamless data integration, analysis, and biological insights. Based on the Gen3 Data Commons architecture, CALYPR empowers analysts to manage large-scale genomic datasets and integrate data to build new predictive models. +

+ +
-## About -The [gen3-tracker](https://github.com/CALYPR/gen3_util/) (g3t) command line utility is a combination of tools that facilitate data sharing on the CALYPR platform. It allows you to create a unified data project, upload files, and associate those files with metadata in an incremental manner. Submitted data with g3t gives you all the benefits the data platform offers: data indexing, data exploration, consolidated access, and more! +
-The following guide details the steps a data contributor must take to submit a project to the CALYPR data commons. -## Getting Started +
+

Built on Open Standards

+
+
+
TES
+
Task Execution Service. A GA4GH standard for distributed task execution to enable federated computing.
+
+
+
DRS
+
Data Reference System. A GA4GH standard for data discovery and access.
+
+
+
FHIR
+
Healthcare Interoperability. Exchanging patient health information.
+
+
+
JSON Hyper-Schema
+
JSON-Schema + Graph data. Represent complex and high quality data.
+
+
+ +
-To navigate through each page, use pages list in the top left or using the navigation arrow on the bottom left and right! Otherwise, check out our [requirements](requirements.md) page to get started. +
-![Main landing page for CALYPR IDP](./images/main-page.png) +
+ + + + +
+
+ GRIP +
+
+

GRIP

+

Graph-based data integration for complex research datasets.

+

High-performance graph query engine that provides a unified interface across MongoDB, SQL, and key-value stores. Ideal for complex relational discovery in genomics.

+ Learn more +
+
+ + +
+
+ Funnel +
+
+

Funnel

+

Distributed task execution for petabyte-scale pipelines.

+

Standardized batch computing using the GA4GH TES API. Run Docker-based tasks seamlessly across AWS, Google Cloud, and Kubernetes at any scale.

+ Learn more +
+
+ + +
+
+ Git-DRS +
+
+

Git-DRS

+

Secure data repository system with version control.

+

Manage large-scale genomic data with integrated versioning and metadata management, ensuring reproducibility and data integrity throughout research cycles.

+ Learn more +
+
+
+ +
+ +
+

Join the Beta

+

+ CALYPR is currently in private beta. If you are interested in early access or a demonstration of the platform, please reach out to us at + support@calypr.org. In the meantime, you can explore our GitHub repository and get access to all of our open source tools. +

+
diff --git a/docs/note.md b/docs/note.md deleted file mode 100644 index 1b65636..0000000 --- a/docs/note.md +++ /dev/null @@ -1,2 +0,0 @@ -!!! note - The tools listed here are under development and may be subject to change. diff --git a/docs/requirements.md b/docs/requirements.md deleted file mode 100644 index 62eb900..0000000 --- a/docs/requirements.md +++ /dev/null @@ -1,154 +0,0 @@ ---- -title: Requirements ---- - -# Requirements - -## 1. Download gen3-client - -gen3-client to upload and download files to the [gen3 platform](https://gen3.org/). Since the CALYPR is built on gen3, gen3-client is used in gen3-tracker (g3t) for the same purpose. See the instructions below for how to download gen3-client for your operating system. - -### Installation Instructions - - -=== "macOS" - 1. Download the [macOS version](https://github.com/CALYPR/cdis-data-client/releases/latest/download/gen3-client-macos.pkg) of the gen3-client. - 2. Run the gen3-client pkg, following the instructions in the installer. - 3. Open a terminal window. - 4. Create a new gen3 directory: `mkdir ~/.gen3` - 5. Move the executable to the gen3 directory: `mv /Applications/gen3-client ~/.gen3/gen3-client` - 6. Change file permissions: `chown $USER ~/.bash_profile` - 7. Add the gen3 directory to your PATH environment variable: `echo 'export PATH=$PATH:~/.gen3' >> ~/.bash_profile` - 8. Refresh your PATH: `source ~/.bash_profile` - 9. Check that the program is downloaded: run `gen3-client` - - -=== "Linux" - 1. Download the [Linux version](https://github.com/CALYPR/cdis-data-client/releases/latest/download/gen3-client-linux-amd64.zip) of the gen3-client. - 2. Unzip the archive. - 3. Open a terminal window. - 4. Create a new gen3 directory: `mkdir ~/.gen3` - 5. Move the unzipped executable to the gen3 directory: `~/.gen3/gen3-client` - 6. Change file permissions: `chown $USER ~/.bash_profile` - 7. Add the gen3 directory to your PATH environment variable: `echo 'export PATH=$PATH:~/.gen3' >> ~/.bash_profile` - 8. Refresh your PATH: `source ~/.bash_profile` - 9. Check that the program is downloaded: run `gen3-client` - -=== "Windows" - 1. Download the [Windows version](https://github.com/CALYPR/cdis-data-client/releases/latest/download/gen3-client-windows-amd64.zip) of the gen3-client. - 2. Unzip the archive. - 3. Add the unzipped executable to a directory, for example: `C:\Program Files\gen3-client\gen3-client.exe` - 4. Open the Start Menu and type "edit environment variables". - 5. Open the option "Edit the system environment variables". - 6. In the "System Properties" window that opens up, on the "Advanced" tab, click on the "Environment Variables" button. - 7. In the box labeled "System Variables", find the "Path" variable and click "Edit". - 8. In the window that pops up, click "New". - 9. Type in the full directory path of the executable file, for example: `C:\Program Files\gen3-client` - 10. Click "Ok" on all the open windows and restart the command prompt if it is already open by entering cmd into the start menu and hitting enter. - -## 2. Configure a gen3-client Profile with Credentials - -To use the gen3-client, you need to configure `gen3-client` with API credentials downloaded from the [Profile page](https://calypr.ohsu.edu.org/Profile). - -![Gen3 Profile page](images/profile.png) - -Log into the website. Then, download the access key from the portal and save it in the standard location `~/.gen3/credentials.json` - -![Gen3 Credentials](images/credentials.png) - -From the command line, run the gen3-client configure command: - -=== "Example Command" - ```sh - gen3-client configure \ - --profile= \ - --cred= \ - --apiendpoint=https://calypr.ohsu.edu.org - ``` - -=== "Mac/Linux" - ```sh - gen3-client configure \ - --profile=calypr \ - --cred=~/Downloads/credentials.json \ - --apiendpoint=https://calypr.ohsu.edu.org - ``` -=== "Windows" - ```sh - gen3-client configure \ - --profile=calypr \ - --cred=C:\Users\demo\Downloads\credentials.json \ - --apiendpoint=https://calypr.ohsu.edu.org - ``` - -Run the `gen3-client auth` command to confirm you configured a profile with the correct authorization privileges. Then, to list your access privileges for each project in the commons you have access to: - -```sh -gen3-client auth --profile=calypr - -# 2023/12/05 15:07:12 -# You have access to the following resource(s) at https://calypr.ohsu.edu.org: -# 2023/12/05 15:07:12 /programs/calypr/projects/myproject... -``` - -## 3. Install gen3-tracker (g3t) - -The `gen3-tracker (g3t)` tool requires a working Python 3 installation no older than [Python 3.12](https://www.python.org/downloads/release/python-3120/). Check your version with `python3 --version`. If needed, download a compatible version of [Python 3](https://www.python.org/downloads/). - -Optionally, create a virtual environment using venv or conda for g3t. We will use [venv](https://docs.python.org/3/library/venv.html) in the instructions. - -``` -python3 -m venv venv; source venv/bin/activate -``` - -Run the following in your working directory to install the latest version of g3t from the Python Package Index: - -```sh -pip install gen3-tracker -``` - -You can verify the installation was successful by then running the `g3t` command with the expected output being the [latest version](https://pypi.org/project/gen3-tracker/#history): - -```sh -g3t --version -``` - -### Upgrading g3t - -This version should match the latest version on the [PyPi page](https://pypi.org/project/gen3-tracker/). If it is out of date, run the following to upgrade your local version: - -```sh -pip install -U gen3-tracker -``` - -### Configuration - -g3t uses the [gen3-client](https://gen3.org/resources/user/gen3-client/#2-configure-a-profile-with-credentials) configuration flow. - -After configuration, you can either specify the `--profile` or set the `G3T_PROFILE=profile-name` environmental variable. - -### Testing the configuration - -The command `g3t ping` will confirm that the access key and gen3-client have been configured correctly - -```sh -g3t --profile calypr ping -``` - -A successful ping will output something like: - -> msg: 'Configuration OK: Connected using profile:calypr' -> -> endpoint: https://calypr.ohsu.edu.org -> -> username: someone@example.com -> -> bucket_programs: -> -> ... -> -> your_access: -> -> ... - -With g3t completely set up, see the [Quickstart Guide](/workflows/quick-start-guide) for how to upload and download data to a project. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index 638e7e8..5c9df1b 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -1,10 +1,29 @@ +@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap'); + /* Prevent the '$' character in shell blocks from being copied */ .gp { user-select: none; } -h1, h2, h3 { - font-weight: bold !important; +:root { + --md-primary-fg-color: #0057B7; + --md-primary-fg-color--light: #4698CA; + --card-background: #ffffff; + --card-shadow: 0 4px 20px rgba(0, 0, 0, 0.08); + --card-shadow-hover: 0 12px 30px rgba(0, 0, 0, 0.12); + --text-muted: #64748b; + --transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1); +} + +body { + font-family: 'Inter', -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; +} + +h1, +h2, +h3 { + font-weight: 700 !important; + letter-spacing: -0.02em; } /* horizontal dividers */ @@ -13,13 +32,273 @@ h1, h2, h3 { display: block; width: 100%; height: 1px; - background-color: lightgrey; + background-color: #e2e8f0; margin-top: 0.5em; margin-bottom: 1.5em; } -/* colors */ -:root > * { - --md-primary-fg-color: #0057B7; - --md-primary-fg-color--light: #4698CA; +/* Hero section container */ +.md-hero { + background-image: linear-gradient(135deg, var(--md-primary-fg-color), #1e40af); + background: + linear-gradient(rgba(0, 48, 102, 0.4), rgba(0, 48, 102, 0.4)), + url("../assets/banner_fade.png"); + background-size: cover; + background-position: center; + color: white; + padding: 6rem 0; + clip-path: ellipse(150% 100% at 50% 0%); +} + +.md-hero__inner { + display: flex; + flex-direction: column; + align-items: center; + text-align: center; +} + +.md-hero__content h1 { + font-size: 3rem; + font-weight: 800; + margin-bottom: 1rem; + text-shadow: 0 2px 10px rgba(0, 0, 0, 0.1); +} + +.md-hero__content div { + font-size: 1.25rem; + max-width: 40rem; + margin-bottom: 2rem; + opacity: 0.95; + font-weight: 400; +} + +/* Product Grid */ +.product-grid { + display: grid; + grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); + gap: 2rem; + max-width: 1100px; + margin: 0rem auto 4rem; + padding: 0 1rem; + position: relative; + z-index: 10; +} + +/* Professional Product Card */ +.product-card { + background: var(--card-background); + border-radius: 12px; + box-shadow: var(--card-shadow); + overflow: hidden; + transition: var(--transition); + border: 1px solid rgba(226, 232, 240, 0.8); + display: flex; + flex-direction: column; +} + +.product-card:hover { + transform: translateY(-8px); + box-shadow: var(--card-shadow-hover); + border-color: var(--md-primary-fg-color--light); +} + +.product-card--featured:hover { + transform: none; +} + +/* Featured / Umbrella Card */ +.product-card--featured { + grid-column: 1 / -1; + flex-direction: column; + min-height: 500px; + background: white; +} + +.product-card--featured .product-card__image-wrap { + width: 100%; + height: 650px; + border-bottom: none; + border-right: none; + padding: 0; + background: #f8fafc; +} + +/* Gradient transition from image to text */ +.product-card--featured .product-card__image-wrap::after { + content: ""; + position: absolute; + bottom: 0; + left: 0; + width: 100%; + height: 80px; + background: linear-gradient(to bottom, transparent, white); + z-index: 2; +} + +.product-card--featured .product-card__image { + width: 100%; + height: 100%; + max-width: none; + max-height: none; + object-fit: cover; + object-position: center 20%; + z-index: 1; +} + +.product-card--featured .product-card__content { + padding: 2rem 4rem 4rem; + text-align: center; + align-items: center; +} + +.product-card--featured .product-card__title { + font-size: 2.8rem; + margin-bottom: 1rem; +} + +.product-card--featured .product-card__summary { + font-size: 1.4rem; + margin-bottom: 1.5rem; + max-width: 800px; +} + +.product-card--featured .product-card__description { + font-size: 1.1rem; + max-width: 750px; + margin-bottom: 2rem; +} + +.product-card__image-wrap { + position: relative; + width: 100%; + height: 180px; + background: #f8fafc; + border-bottom: 1px solid #f1f5f9; + display: flex; + align-items: center; + justify-content: center; + overflow: hidden; +} + +.product-card__image { + max-width: 80%; + max-height: 80%; + object-fit: contain; + transition: var(--transition); +} + +.product-card:hover .product-card__image { + transform: scale(1.05); +} + +.product-card--featured:hover .product-card__image { + transform: scale(1); +} + +.product-card__content { + padding: 1.5rem; + flex-grow: 1; + display: flex; + flex-direction: column; +} + +.product-card__title { + color: #0f172a; + font-size: 1.5rem; + font-weight: 700; + margin-bottom: 0.75rem; +} + +.product-card__summary { + color: #334155; + font-size: 0.95rem; + font-weight: 500; + margin-bottom: 0.75rem; + line-height: 1.4; +} + +.product-card__description { + color: var(--text-muted); + font-size: 0.875rem; + line-height: 1.6; + margin-bottom: 1.5rem; + flex-grow: 1; } + +.product-card__link { + display: inline-flex; + align-items: center; + color: var(--md-primary-fg-color); + font-weight: 600; + font-size: 0.95rem; + text-decoration: none; + transition: var(--transition); +} + +.product-card__link i { + margin-left: 0.25rem; + transition: var(--transition); +} + +.product-card__link:hover { + color: #1a73e8; +} + +.product-card__link:hover i { + transform: translateX(4px); +} + +.product-card--featured .product-card__link:hover i { + transform: none; +} + +/* Responsive */ +@media screen and (max-width: 768px) { + .product-grid { + grid-template-columns: 1fr; + margin-top: 2rem; + } + + .md-hero__content h1 { + font-size: 2.25rem; + } + + .product-card--featured { + flex-direction: column; + } + + .product-card--featured .product-card__image-wrap { + width: 100%; + height: 200px; + border-right: none; + border-bottom: none; + padding: 2rem 2rem 0; + } + + .product-card--featured .product-card__content { + padding: 1.5rem; + } + + .product-card--featured .product-card__title { + font-size: 1.75rem; + } +} + +/* Sidebar Navigation Styles */ +/* Target top-level section headers, nested collapsible headers, and index links */ +.md-nav__item--section>.md-nav__link, +.md-nav__item--section>label.md-nav__link, +.md-nav__item--section>.md-nav__link--index, +.md-nav__link[for], +.md-nav__item--nested>.md-nav__link, +.md-nav__link--index { + font-weight: 700 !important; + color: #000000 !important; + opacity: 1 !important; +} + +/* Adjust top-level sidebar items that are links to follow a similar weight */ +.md-nav--primary>.md-nav__list>.md-nav__item>.md-nav__link { + font-weight: 700; + color: #000000; +} \ No newline at end of file diff --git a/docs/tools/.nav.yml b/docs/tools/.nav.yml new file mode 100644 index 0000000..ddc8347 --- /dev/null +++ b/docs/tools/.nav.yml @@ -0,0 +1 @@ +title: Tools diff --git a/docs/tools/data-client/.nav.yml b/docs/tools/data-client/.nav.yml new file mode 100644 index 0000000..6c70a75 --- /dev/null +++ b/docs/tools/data-client/.nav.yml @@ -0,0 +1,6 @@ +title: Data Client +nav: + - Welcome: index.md + - Authentication: authentication.md + - Data Management: data_management.md + - Access Requests: access_requests.md diff --git a/docs/tools/data-client/access_requests.md b/docs/tools/data-client/access_requests.md new file mode 100644 index 0000000..e5750bd --- /dev/null +++ b/docs/tools/data-client/access_requests.md @@ -0,0 +1,68 @@ +--- +title: Access & Collaboration +--- + +# Access & Collaboration + +The `data-client` includes tools to manage user access and collaboration through the **Requestor** service. This allows project administrators to invite users (collaborators) to projects and manage access requests. + +## Managing Collaborators + +The `collaborator` command suite is used to add or remove users from projects. + +### Add a User + +To give a user access to a project: + +```bash +./data-client collaborator add [project_id] [username] --profile= +``` + +- **project_id**: Format `program-project` (e.g., `SEQ-Res`). +- **username**: The user's email address. + +**Options:** +- `--write` (`-w`): Grant write access. +- `--approve` (`-a`): Automatically approve the request (if you have admin permissions). + +### Remove a User + +To revoke access: + +```bash +./data-client collaborator rm [project_id] [username] --profile= +``` + +**Options:** +- `--approve` (`-a`): Automatically approve the revocation. + +## Managing Requests + +### List Requests + +List access requests associated with you or a user. + +```bash +./data-client collaborator ls --profile= +``` + +**Options:** +- `--mine`: List your requests. +- `--active`: List only active requests. +- `--username`: List requests for a specific user (admin only). + +### List Pending Requests + +See requests waiting for approval. + +```bash +./data-client collaborator pending --profile= +``` + +### Approve a Request + +If you are a project administrator, you can approve pending requests. + +```bash +./data-client collaborator approve [request_id] --profile= +``` diff --git a/docs/tools/data-client/authentication.md b/docs/tools/data-client/authentication.md new file mode 100644 index 0000000..fa37a1e --- /dev/null +++ b/docs/tools/data-client/authentication.md @@ -0,0 +1,45 @@ +--- +title: Authentication & Access +--- + +# Authentication & Access (Fence) + +The `data-client` uses the **Fence** service to manage authentication and user access privileges. + +## Authentication Setup + +Authentication is handled via the `configure` command using an API Key credential file. See [Configuration](index.md#configuration) for details. + +When you run a command, the `data-client`: +1. Validates your API Key. +2. Requests a temporary Access Token from Fence. +3. Uses this Access Token for subsequent API calls. + +If your Access Token has expired, the client automatically refreshes it using your API Key. + +## Checking Privileges + +You can verify your current access privileges and see which projects/resources you have access to using the `auth` command. + +### Command + +```bash +./data-client auth --profile= +``` + +### Example Usage + +```bash +./data-client auth --profile=mycommons +``` + +### Output + +The command lists the resources (projects) you can access and the specific permissions you have for each (e.g., read, write, delete). + +```text +You have access to the following resource(s) at https://data.mycommons.org: + +/programs/program1/projects/projectA [read, read-storage, write-storage] +/programs/program1/projects/projectB [read] +``` diff --git a/docs/tools/data-client/data_management.md b/docs/tools/data-client/data_management.md new file mode 100644 index 0000000..30dfbb9 --- /dev/null +++ b/docs/tools/data-client/data_management.md @@ -0,0 +1,64 @@ +--- +title: Data Management +--- + +# Data Management + +The `data-client` facilitates secure data transfer between your local environment and the Gen3 Data Commons using the **Indexd** (indexing) and **Fence** (authentication) services. + +## Uploading Data + +You can upload files or directories for registration and storage in the Data Commons. The process handles: +1. Registering the file with `Indexd` (creating a GUID). +2. Obtaining a presigned URL from `Fence`. +3. Uploading the file content to object storage (e.g., S3). + +### Command + +```bash +./data-client upload --profile= --upload-path= +``` + +### Options + +- `--upload-path`: Path to a single file, a folder, or a glob pattern (e.g., `data/*.bam`). +- `--batch`: Enable parallel uploads for better performance. +- `--numparallel`: Number of parallel uploads (default: 3). +- `--bucket`: Target bucket (if not using default). +- `--metadata`: Look for `[filename]_metadata.json` sidecar files to upload metadata alongside the file. + +### Example + +Upload a single file: +```bash +./data-client upload --profile=mycommons --upload-path=data/sample.bam +``` + +Upload a directory with parallel processing: +```bash +./data-client upload --profile=mycommons --upload-path=data/ --batch --numparallel=5 +``` + +## Downloading Data + +You can download data using their GUIDs (Globally Unique Identifiers). + +### Command + +```bash +./data-client download --profile= --guid= +``` + +### Options + +- `--guid`: The GUID of the file to download. +- `--no-prompt`: Skip overwrite confirmation prompts. +- `--dir`: Target directory for download (default: current directory). + +To download multiple files, you can use the `download-multiple` functionality (often via manifest, check `./data-client download --help` for specific usages as they may vary). + +### Example + +```bash +./data-client download --profile=mycommons --guid=dg.1234/5678-abcd +``` diff --git a/docs/tools/data-client/index.md b/docs/tools/data-client/index.md new file mode 100644 index 0000000..2a6000c --- /dev/null +++ b/docs/tools/data-client/index.md @@ -0,0 +1,57 @@ +--- +title: Data Client +--- + +# Data Client + +The `data-client` is the modern CALYPR client library and CLI tool. It serves two primary purposes: +1. **Data Interaction**: A unified interface for uploading, downloading, and managing data in Gen3 Data Commons. +2. **Permissions Management**: It handles user access and project collaboration, replacing older tools like `calypr_admin`. + +## Architecture + +The `data-client` is built upon a modular architecture centered around the `Gen3Interface`. This interface acts as a facade, coordinating interactions with specific Gen3 services. + +```mermaid +graph TD + CLI[Data Client CLI] --> G3I[Gen3Interface] + G3I --> Auth[Fence Client] + G3I --> Idx[Indexd Client] + G3I --> Job[Sower Client] + G3I --> Req[Requestor Client] + + Auth --> |Authentication/Tokens| FenceService((Fence Service)) + Idx --> |File Registration| IndexdService((Indexd Service)) + Job --> |Job Submission| SowerService((Sower Service)) + Req --> |Access Requests| RequestorService((Requestor Service)) +``` + +### Components + +The `data-client` integrates the following Gen3 clients: + +- **Fence Client**: Handles authentication (API keys, Access Tokens) and presigned URL generation for data access. +- **Indexd Client**: Manages file registration (GUIDs), indexing, and file location resolution. +- **Sower Client**: Manages job submissions and monitoring (e.g., for data analysis workflows). +- **Requestor Client**: Handles data access requests and collaboration management. + +## Configuration + +The `data-client` uses a configuration profile system to manage credentials for different Gen3 commons. + +Configuration is stored in `~/.gen3/gen3_client_config.ini`. + +### Setting up a Profile + +To configure a new profile, you need an API Key (Credential file) downloaded from the Gen3 Commons profile page. + +```bash +./data-client configure --profile= --cred= --apiendpoint= +``` + +Example: +```bash +./data-client configure --profile=mycommons --cred=credentials.json --apiendpoint=https://data.mycommons.org +``` + +Once configured, you can use the `--profile` flag in other commands to target this environment. diff --git a/docs/tools/forge/.nav.yml b/docs/tools/forge/.nav.yml new file mode 100644 index 0000000..ab1e833 --- /dev/null +++ b/docs/tools/forge/.nav.yml @@ -0,0 +1,6 @@ +title: Forge +nav: + - Overview: index.md + - Validation: validation.md + - Publishing: publishing.md + - Configuration: configuration.md diff --git a/docs/tools/forge/configuration.md b/docs/tools/forge/configuration.md new file mode 100644 index 0000000..26dcf7f --- /dev/null +++ b/docs/tools/forge/configuration.md @@ -0,0 +1,125 @@ +--- +title: Configuration +--- + +# Forge Configuration + +Forge manages the configuration for the CALYPR Explorer UI. This configuration defines how data is displayed, filtered, and accessed in the web interface. + +## Creating a Configuration + +You can generate a starter configuration template for your project using the `forge config` command. + +```bash +forge config --remote +``` + +This command: +1. Reads the Project ID from your specified remote (or default remote). +2. Creates a `CONFIG` directory if it doesn't exist. +3. Generates a template JSON file named `.json` inside `CONFIG/`. + +**Example:** + +```bash +forge config --remote production +``` + +If your project ID is `my-project`, this creates `CONFIG/my-project.json`. + +## Editing Configuration + +The configuration is a standard JSON file. You can edit it with any text editor. + +### Top-Level Structure + +The configuration is an array of objects, where each object represents a **Tab** in the data explorer (e.g., "Patients", "Samples", "Files"). + +```json +{ + "ExplorerConfig": [ + { + "tabTitle": "Research Subject", + "filters": { ... }, + "table": { ... }, + "guppyConfig": { ... } + } + ] +} +``` + +### Key Components + +#### `tabTitle` +The display name of the tab in the UI. + +#### `guppyConfig` +Defines the connection to the backend index (Guppy). + +- `dataType`: The index type in Guppy (e.g., "patient", "file"). +- `nodeCountTitle`: Label for the count of items (e.g., "Patients"). +- `accessibleFieldCheckList`: Fields to check for access control (usually `["project_id"]`). + +#### `table` +Configures the data table displayed in the tab. + +- `enabled`: Set to `true` to show the table. +- `fields`: Array of field names to include in the table data. +- `columns`: Dictionary defining how each field is rendered. + - `title`: Column header text. + - `cellRenderFunction`: Optional custom renderer (e.g., "HumanReadableString" for file sizes). + +#### `filters` +Configures the faceted search filters on the left sidebar. + +- `tabs`: Grouping of filters. + - `fields`: List of fields to show as filters. + - `fieldsConfig`: Custom labels for the filters. + +## Example Configuration + +Here is a simplified example configuration for a "Research Subject" tab: + +```json +{ + "ExplorerConfig": [ + { + "tabTitle": "Research Subject", + "guppyConfig": { + "dataType": "researchsubject", + "nodeCountTitle": "Research Subjects", + "accessibleFieldCheckList": ["project_id"] + }, + "filters": { + "tabs": [ + { + "fields": ["project_id", "gender", "race"], + "fieldsConfig": { + "project_id": { "label": "Project" }, + "gender": { "label": "Gender" } + } + } + ] + }, + "table": { + "enabled": true, + "fields": ["project_id", "submitter_id", "gender", "race"], + "columns": { + "project_id": { "title": "Project" }, + "submitter_id": { "title": "ID" }, + "gender": { "title": "Gender" }, + "race": { "title": "Race" } + } + } + } + ] +} +``` + +## Validation + +After editing your configuration, always validate it to ensure there are no syntax errors or invalid structures. + +```bash +forge validate config --path CONFIG/my-project.json +``` diff --git a/docs/tools/forge/index.md b/docs/tools/forge/index.md new file mode 100644 index 0000000..5d10a27 --- /dev/null +++ b/docs/tools/forge/index.md @@ -0,0 +1,48 @@ +--- +title: Forge +--- + +# Forge + +Forge is the CALYPR metadata management tool. It streamlines the validation, publishing, and management of data dictionaries and metadata schemas for Gen3 Data Commons. + +## Core Features + +- **Validation**: Validate your data and schemas against the Gen3 data model. +- **Publishing**: Publish schemas and metadata to a Gen3 instance. +- **Metadata Management**: Tools to query and manipulate metadata. + +## Commands + +### `validate` + +The `validate` command suite is used to ensure your data and configurations are correct before submission. + +- **`forge validate config `**: Validates a configuration file. +- **`forge validate data `**: Validates data files (e.g., JSON, TSV) against the schema. +- **`forge validate edge `**: Validates relationships (edges) between data nodes. + +### `publish` + +Manage the publishing lifecycle of your data schemas. + +- **`forge publish`**: Publish the current schema/metadata to the configured environment. +- **`forge publish status`**: Check the status of a publishing job. +- **`forge publish list`**: List available publication resources. +- **`forge publish output`**: Retrieve the output of a publication process. + +### `meta` + +Tools for handling metadata directly. + +```bash +forge meta [subcommand] +``` + +### `config` + +Manage Forge configuration settings. + +```bash +forge config +``` diff --git a/docs/tools/forge/publishing.md b/docs/tools/forge/publishing.md new file mode 100644 index 0000000..060848b --- /dev/null +++ b/docs/tools/forge/publishing.md @@ -0,0 +1,68 @@ +--- +title: Publishing +--- + +# Publishing + +The `forge` tool handles the lifecycle of publishing metadata to Gen3 Commons via the **Sower** service (for async job processing). + +## Publishing Metadata + +To start a new metadata publication job: + +```bash +forge publish [flags] +``` + +This command submits a job to the Sower service. + +**Arguments:** +- ``: A GitHub Personal Access Token (PAT) is required by the backend worker to access the repository containing the metadata schema. + +**Flags:** +- `--remote`, `-r`: Target remote DRS server name (default: "default_remote"). + +**Output:** +Returns the Job UID, Name, and initial Status. + +```text +Uid: 12345-abcde Name: metadata-publish Status: PENDING +``` + +## Monitoring Jobs + +### List Jobs + +View all jobs cataloged in Sower. + +```bash +forge publish list [flags] +``` + +**Flags:** +- `--remote`, `-r`: Target remote DRS server. + +### Check Status + +Check the status of a specific job by its UID. + +```bash +forge publish status [flags] +``` + +**Flags:** +- `--remote`, `-r`: Target remote DRS server. + +### View Logs + +Retrieve the output logs of a specific job. + +```bash +forge publish output [flags] +``` + +**Flags:** +- `--remote`, `-r`: Target remote DRS server. + +**Output:** +Displays the raw logs from the backend job execution, which is useful for debugging failures. diff --git a/docs/tools/forge/validation.md b/docs/tools/forge/validation.md new file mode 100644 index 0000000..f6a4df2 --- /dev/null +++ b/docs/tools/forge/validation.md @@ -0,0 +1,81 @@ +--- +title: Validation +--- + +# Validation + +The `forge validate` command suite ensures that your metadata and configuration files adhere to the expected formats and schemas. This is a critical step before publishing data to a Gen3 Commons. + +## Validate Data + +Validates FHIR-based metadata files (NDJSON format) against a JSON schema. + +```bash +forge validate data [flags] +``` + +By default, it looks for files in a `META` directory or can be pointed to a specific file/directory. + +**Flags:** +- `--path`, `-p`: Path to metadata file(s) or directory to validate (default: `META`). + +**Behavior:** +- Checks if files are valid NDJSON. +- Validates each row against the corresponding JSON schema. +- Reports total files, rows, and errors found. + +**Output Example:** +```text +File: META/Patient.ndjson + Rows validated: 15 + Errors found: 0 +--- +Overall Totals + Files validated: 1 + Rows validated: 15 + Errors: 0 +``` + +## Validate Edge + +Checks for integrity issues in the graph data, specifically looking for "orphaned edges"—relationships that point to non-existent vertices. + +```bash +forge validate edge [flags] +``` + +**Flags:** +- `--path`, `-p`: Path to metadata files directory (default: `META`). +- `--out-dir`, `-o`: Directory to save generated vertices and edges files (JSON). + +**Behavior:** +- Generates graph elements (vertices and edges) from the input NDJSON files. +- Verifies that every edge points to a valid destination vertex. +- Can optionally export the vertices and edges to disk. + +**Output Example:** +```text +File: META/Patient.ndjson + Rows processed: 15 + Vertices generated: 15 + Edges generated: 0 +--- +Orphaned Edges: 0 +Overall Totals: + Files processed: 1 + Rows processed: 15 + Vertices generated: 15 + Edges generated: 0 + Orphaned edges: 0 +``` + +## Validate Config + +Validates the explorer configuration file structure. + +```bash +forge validate config [flags] +``` + +**Flags:** +- `--path`, `-p`: Path to config file to validate (default: `CONFIG`). diff --git a/docs/tools/funnel/_releases.md b/docs/tools/funnel/_releases.md new file mode 100644 index 0000000..3c303ff --- /dev/null +++ b/docs/tools/funnel/_releases.md @@ -0,0 +1,7 @@ +| Asset | Download | +| --- | --- | +| funnel-darwin-amd64-v0.11.7.tar.gz | [Download](https://github.com/ohsu-comp-bio/funnel/releases/download/v0.11.7/funnel-darwin-amd64-v0.11.7.tar.gz) | +| funnel-darwin-arm64-v0.11.7.tar.gz | [Download](https://github.com/ohsu-comp-bio/funnel/releases/download/v0.11.7/funnel-darwin-arm64-v0.11.7.tar.gz) | +| funnel-linux-amd64-v0.11.7.tar.gz | [Download](https://github.com/ohsu-comp-bio/funnel/releases/download/v0.11.7/funnel-linux-amd64-v0.11.7.tar.gz) | +| funnel-linux-arm64-v0.11.7.tar.gz | [Download](https://github.com/ohsu-comp-bio/funnel/releases/download/v0.11.7/funnel-linux-arm64-v0.11.7.tar.gz) | +| funnel-v0.11.7-checksums.txt | [Download](https://github.com/ohsu-comp-bio/funnel/releases/download/v0.11.7/funnel-v0.11.7-checksums.txt) | diff --git a/docs/tools/funnel/docs.md b/docs/tools/funnel/docs.md new file mode 100644 index 0000000..51e6e66 --- /dev/null +++ b/docs/tools/funnel/docs.md @@ -0,0 +1,81 @@ +--- +title: Overview +menu: + main: + identifier: docs + weight: -1000 +--- + +# Overview + +Funnel makes distributed, batch processing easier by providing a simple task API and a set of +components which can easily adapted to a vareity of platforms. + +### Task + +A task defines a unit of work: metadata, input files to download, a sequence of Docker containers + commands to run, +output files to upload, state, and logs. The API allows you to create, get, list, and cancel tasks. + +Tasks are accessed via the `funnel task` command. There's an HTTP client in the [client package][clientpkg], +and a set of utilities and a gRPC client in the [proto/tes package][tespkg]. + +There's a lot more you can do with the task API. See the [tasks docs](./docs/tasks.md) for more. + +### Server + +The server serves the task API, web dashboard, and optionally runs a task scheduler. +It serves both HTTP/JSON and gRPC/Protobuf. + +The server is accessible via the `funnel server` command and the [server package][serverpkg]. + +### Storage + +Storage provides access to file systems such as S3, Google Storage, and local filesystems. +Tasks define locations where files should be downloaded from and uploaded to. Workers handle +the downloading/uploading. + +See the [storage docs](./docs/storage.md) for more information on configuring storage backends. +The storage clients are available in the [storage package][storagepkg]. + +### Worker + +A worker is reponsible for executing a task. There is one worker per task. A worker: + +- downloads the inputs +- runs the sequence of executors (usually via Docker) +- uploads the outputs + +Along the way, the worker writes logs to event streams and databases: + +- start/end time +- state changes (initializing, running, error, etc) +- executor start/end times +- executor exit codes +- executor stdout/err logs +- a list of output files uploaded, with sizes +- system logs, such as host name, docker command, system error messages, etc. + +The worker is accessible via the `funnel worker` command and the [worker package][workerpkg]. + +### Node Scheduler + +A node is a service that stays online and manages a pool of task workers. A Funnel cluster +runs a node on each VM. Nodes communicate with a Funnel scheduler, which assigns tasks +to nodes based on available resources. Nodes handle starting workers when for each assigned +task. + +Nodes aren't always required. In some cases it often makes sense to rely on an existing, +external system for scheduling tasks and managing cluster resources, such as AWS Batch +or HPC systems like HTCondor, Slurm, Grid Engine, etc. Funnel provides integration with +these services that doesn't include nodes or scheduling by Funnel. + +See [Deploying a cluster](./docs/compute/deployment.md) for more information about running a cluster of nodes. + +The node is accessible via the `funnel node` command and the [scheduler package][schedpkg]. + +[tes]: https://github.com/ga4gh/task-execution-schemas +[serverpkg]: https://github.com/ohsu-comp-bio/funnel/tree/main/server +[workerpkg]: https://github.com/ohsu-comp-bio/funnel/tree/main/worker +[schedpkg]: https://github.com/ohsu-comp-bio/funnel/tree/main/compute/scheduler +[tespkg]: https://github.com/ohsu-comp-bio/funnel/tree/main/tes +[storagepkg]: https://github.com/ohsu-comp-bio/funnel/tree/main/storage diff --git a/docs/tools/funnel/docs/compute/aws-batch.md b/docs/tools/funnel/docs/compute/aws-batch.md new file mode 100644 index 0000000..bebc256 --- /dev/null +++ b/docs/tools/funnel/docs/compute/aws-batch.md @@ -0,0 +1,100 @@ +--- +title: AWS Batch +menu: + main: + parent: Compute + weight: 20 +--- + +# AWS Batch + +This guide covers deploying a Funnel server that leverages [DynamoDB][0] for storage +and [AWS Batch][1] for task execution. + +## Setup + +Get started by creating a compute environment, job queue and job definition using either +the Funnel CLI or the AWS Batch web console. To manage the permissions of instanced +AWS Batch jobs create a new IAM role. For the Funnel configuration outlined +in this document, this role will need to provide read and write access to both S3 and DynamoDB. + +_Note_: We recommend creating the Job Definition with Funnel by running: `funnel aws batch create-job-definition`. +Funnel expects the JobDefinition to start a Funnel worker process with a specific configuration. +Only advanced users should consider making any substantial changes to this Job Definition. + +AWS Batch tasks, by default, launch the ECS Optimized AMI which includes +an 8GB volume for the operating system and a 22GB volume for Docker image and metadata +storage. The default Docker configuration allocates up to 10GB of this storage to +each container instance. [Read more about the default AMI][8]. Due to these limitations, we +recommend [creating a custom AMI][7]. Because AWS Batch has the same requirements for your +AMI as Amazon ECS, use the default Amazon ECS-optimized Amazon Linux AMI as a base and change it +to better suit your tasks. + +### Steps +* [Create a Compute Environment][3] +* (_Optional_) [Create a custom AMI][7] +* [Create a Job Queue][4] +* [Create an EC2ContainerTaskRole with policies for managing access to S3 and DynamoDB][5] +* [Create a Job Definition][6] + +For more information check out AWS Batch's [getting started guide][2]. + +### Quickstart + +``` +$ funnel aws batch create-all-resources --region us-west-2 + +``` + +This command will create a compute environment, job queue, IAM role and job definition. + +## Configuring the Funnel Server + +Below is an example configuration. Note that the `Key` +and `Secret` fields are left blank in the configuration of the components. This is because +Funnel will, by default, try to automatically load credentials from the environment. +Alternatively, you may explicitly set the credentials in the config. + +```YAML +Database: "dynamodb" +Compute: "aws-batch" +EventWriters: + - "log" + +Dynamodb: + TableBasename: "funnel" + Region: "us-west-2" + Key: "" + Secret: "" + +Batch: + JobDefinition: "funnel-job-def" + JobQueue: "funnel-job-queue" + Region: "us-west-2" + Key: "" + Secret: "" + +AmazonS3: + Key: "" + Secret: "" +``` + +### Start the server + +```sh +funnel server run --config /path/to/config.yaml +``` + +### Known issues + +The `Task.Resources.DiskGb` field does not have any effect. See [issue 317](https://github.com/ohsu-comp-bio/funnel/issues/317). + +[0]: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html +[1]: http://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html +[2]: http://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html +[3]: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#/compute-environments/new +[4]: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#/queues/new +[5]: https://console.aws.amazon.com/iam/home?region=us-west-2#/roles$new?step=permissions&selectedService=EC2ContainerService&selectedUseCase=EC2ContainerTaskRole +[6]: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#/job-definitions/new +[7]: http://docs.aws.amazon.com/batch/latest/userguide/create-batch-ami.html +[8]: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html diff --git a/docs/tools/funnel/docs/compute/deployment.md b/docs/tools/funnel/docs/compute/deployment.md new file mode 100644 index 0000000..2ea266f --- /dev/null +++ b/docs/tools/funnel/docs/compute/deployment.md @@ -0,0 +1,79 @@ +--- +title: Deploying a cluster +menu: + main: + parent: Compute + weight: -50 +--- + +# Deploying a cluster + +This guide describes the basics of starting a cluster of Funnel nodes. +This guide is a work in progress. + +A node is a service +which runs on each machine in a cluster. The node connects to the Funnel server and reports +available resources. The Funnel scheduler process assigns tasks to nodes. When a task is +assigned, a node will start a worker process. There is one worker process per task. + +Nodes aren't always required. In some cases it makes sense to rely on an existing, +external system for scheduling tasks and managing cluster resources, such as AWS Batch, +HTCondor, Slurm, Grid Engine, etc. Funnel provides integration with +these services without using nodes or the scheduler. + +### Usage + +Nodes are available via the `funnel node` command. To start a node, run +```sh +funnel node run --config node.config.yml +``` + +To activate the Funnel scheduler, use the `manual` backend in the config. + +The available scheduler and node config: +```yaml +# Activate the Funnel scheduler. +Compute: manual + +Scheduler: + # How often to run a scheduler iteration. + ScheduleRate: 1s + + # How many tasks to schedule in one iteration. + ScheduleChunk: 10 + + # How long to wait between updates before marking a node dead. + NodePingTimeout: 1m + + # How long to wait for a node to start, before marking the node dead. + NodeInitTimeout: 5m + + +Node: + # If empty, a node ID will be automatically generated using the hostname. + ID: "" + + # If the node has been idle for longer than the timeout, it will shut down. + # -1 means there is no timeout. 0 means timeout immediately after the first task. + Timeout: -1s + + # A Node will automatically try to detect what resources are available to it. + # Defining Resources in the Node configuration overrides this behavior. + Resources: + # CPUs available. + # Cpus: 0 + # RAM available, in GB. + # RamGb: 0.0 + # Disk space available, in GB. + # DiskGb: 0.0 + + # For low-level tuning. + # How often to sync with the Funnel server. + UpdateRate: 5s + +Logger: + # Logging levels: debug, info, error + Level: info + # Write logs to this path. If empty, logs are written to stderr. + OutputFile: "" +``` diff --git a/docs/tools/funnel/docs/compute/grid-engine.md b/docs/tools/funnel/docs/compute/grid-engine.md new file mode 100644 index 0000000..d5b5921 --- /dev/null +++ b/docs/tools/funnel/docs/compute/grid-engine.md @@ -0,0 +1,57 @@ +--- +title: Grid Engine +--- +# Grid Engine + +Funnel can be configured to submit workers to [Grid Engine](https://gridscheduler.sourceforge.net/) by making calls +to `qsub`. + +The Funnel server needs to run on a submission node. +Configure Funnel to use Grid Engine by including the following config: + +It is recommended to update the submit file template so that the +`funnel worker run` command takes a config file as an argument: + +``` +{% raw %} +funnel worker run --config /opt/funnel_config.yml --taskID {{.TaskId}} +{% endraw %} +``` + +```YAML +{% raw %} +Compute: gridengine + +GridEngine: + Template: | + #!/bin/bash + #$ -N {{.TaskId}} + #$ -o {{.WorkDir}}/funnel-stdout + #$ -e {{.WorkDir}}/funnel-stderr + {{if ne .Cpus 0 -}} + {{printf "#$ -pe mpi %d" .Cpus}} + {{- end}} + {{if ne .RamGb 0.0 -}} + {{printf "#$ -l h_vmem=%.0fG" .RamGb}} + {{- end}} + {{if ne .DiskGb 0.0 -}} + {{printf "#$ -l h_fsize=%.0fG" .DiskGb}} + {{- end}} + funnel worker run --taskID {{.TaskId}} +{% endraw %} +``` + +The following variables are available for use in the template: + +| Variable | Description | +|:------------|:-------------| +|TaskId | funnel task id | +|WorkDir | funnel working directory | +|Cpus | requested cpu cores | +|RamGb | requested ram | +|DiskGb | requested free disk space | +|Zone | requested zone (could be used for queue name) | + +See https://golang.org/pkg/text/template for information on creating templates. + +[ge]: http://gridscheduler.sourceforge.net/documentation.html diff --git a/docs/tools/funnel/docs/compute/htcondor.md b/docs/tools/funnel/docs/compute/htcondor.md new file mode 100644 index 0000000..a6c8ebd --- /dev/null +++ b/docs/tools/funnel/docs/compute/htcondor.md @@ -0,0 +1,61 @@ +--- +title: HTCondor +menu: + main: + parent: Compute + weight: 20 +--- +# HTCondor + +Funnel can be configured to submit workers to [HTCondor][htcondor] by making +calls to `condor_submit`. + +The Funnel server needs to run on a submission node. +Configure Funnel to use HTCondor by including the following config: + +It is recommended to update the submit file template so that the +`funnel worker run` command takes a config file as an argument {% raw %} +(e.g. `funnel worker run --config /opt/funnel_config.yml --taskID {{.TaskId}}`){% endraw %} + +```YAML +{% raw %} +Compute: htcondor + +HTCondor: + Template: | + universe = vanilla + getenv = True + executable = funnel + arguments = worker run --taskID {{.TaskId}} + log = {{.WorkDir}}/condor-event-log + error = {{.WorkDir}}/funnel-stderr + output = {{.WorkDir}}/funnel-stdout + should_transfer_files = YES + when_to_transfer_output = ON_EXIT_OR_EVICT + {{if ne .Cpus 0 -}} + {{printf "request_cpus = %d" .Cpus}} + {{- end}} + {{if ne .RamGb 0.0 -}} + {{printf "request_memory = %.0f GB" .RamGb}} + {{- end}} + {{if ne .DiskGb 0.0 -}} + {{printf "request_disk = %.0f GB" .DiskGb}} + {{- end}} + + queue +{% endraw %} +``` +The following variables are available for use in the template: + +| Variable | Description | +|:------------|:-------------| +|TaskId | funnel task id | +|WorkDir | funnel working directory | +|Cpus | requested cpu cores | +|RamGb | requested ram | +|DiskGb | requested free disk space | +|Zone | requested zone (could be used for queue name) | + +See https://golang.org/pkg/text/template for information on creating templates. + +[htcondor]: https://research.cs.wisc.edu/htcondor/ diff --git a/docs/tools/funnel/docs/compute/kubernetes.md b/docs/tools/funnel/docs/compute/kubernetes.md new file mode 100644 index 0000000..a99ab69 --- /dev/null +++ b/docs/tools/funnel/docs/compute/kubernetes.md @@ -0,0 +1,121 @@ +--- +title: Kubernetes +menu: + main: + parent: Compute + weight: 20 +--- + +> Funnel on Kubernetes is in active development and may involve frequent updates + +# Quick Start + +## 1. Deploying with Helm + +```sh +helm repo add ohsu https://ohsu-comp-bio.github.io/helm-charts +helm repo update +helm upgrade --install ohsu funnel +``` + +## Alternative: Deploying with `kubectl` ⚙️" + +### 1. Create a Service: + +Deploy it: + +```sh +kubectl apply -f funnel-service.yml +``` + +### 2. Create Funnel config files + +> *[funnel-server.yaml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/funnel-server.yaml)* + +> *[funnel-worker.yaml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/funnel-worker.yaml)* + +Get the clusterIP: + +```sh +{% raw %} +export HOSTNAME=$(kubectl get services funnel --output=jsonpath='{.spec.clusterIP}') + +sed -i "s|\${HOSTNAME}|${HOSTNAME}|g" funnel-worker.yaml +{% endraw %} +``` + +### 3. Create a ConfigMap + +```sh +kubectl create configmap funnel-config --from-file=funnel-server.yaml --from-file=funnel-worker.yaml +``` + +### 4. Create a Service Account for Funnel + +Define a Role and RoleBinding: + +> *[role.yml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/role.yml)* + +> *[role_binding.yml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/role_binding.yml)* + +```sh +kubectl create serviceaccount funnel-sa --namespace default +kubectl apply -f role.yml +kubectl apply -f role_binding.yml +``` + +### 5. Create a Persistent Volume Claim + +> *[funnel-storage-pvc.yml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/funnel-storage-pvc.yml)* + +```sh +kubectl apply -f funnel-storage-pvc.yml +``` + +### 6. Create a Deployment + +> *[funnel-deployment.yml](https://github.com/ohsu-comp-bio/funnel/blob/develop/deployments/kubernetes/funnel-deployment.yml)* + +```sh +kubectl apply -f funnel-deployment.yml +``` + +{% raw %}{{< /details >}}{% endraw %} + +# 2. Proxy the Service for local testing + +```sh +kubectl port-forward service/funnel 8000:8000 +``` + +Now the funnel server can be accessed as if it were running locally. This can be verified by listing all tasks, which will return an empty JSON list: + +```sh +funnel task list +# {} +``` + +A task can then be submitted following the [standard workflow](../tasks.md): + +```sh +funnel examples hello-world > hello-world.json + +funnel task create hello-world.json +# +``` + +# Storage Architecture + + + + + +# Additional Resources 📚 + +- [Helm Repo](https://ohsu-comp-bio.github.io/helm-charts) + +- [Helm Repo Source](https://github.com/ohsu-comp-bio/helm-charts) + +- [Helm Charts](https://github.com/ohsu-comp-bio/funnel/tree/develop/deployments/kubernetes/helm) + +- [The Chart Best Practices Guide](https://helm.sh/docs/chart_best_practices/) diff --git a/docs/tools/funnel/docs/compute/pbs-torque.md b/docs/tools/funnel/docs/compute/pbs-torque.md new file mode 100644 index 0000000..59b2dfa --- /dev/null +++ b/docs/tools/funnel/docs/compute/pbs-torque.md @@ -0,0 +1,57 @@ +--- +title: PBS/Torque +render_macros: true +menu: + main: + parent: Compute + weight: 20 +--- +# PBS/Torque + +Funnel can be configured to submit workers to [PBS/Torque][pbs] by making calls +to `qsub`. + +The Funnel server needs to run on a submission node. +Configure Funnel to use PBS by including the following config: + +It is recommended to update the submit file template so that the +`funnel worker run` command takes a config file as an argument +(e.g. `funnel worker run --config /opt/funnel_config.yml --taskID {% raw %}{{.TaskId}}{% endraw %}`) + +{% raw %} +```YAML +Compute: pbs + +PBS: + Template: | + #!/bin/bash + #PBS -N {{.TaskId}} + #PBS -o {{.WorkDir}}/funnel-stdout + #PBS -e {{.WorkDir}}/funnel-stderr + {{if ne .Cpus 0 -}} + {{printf "#PBS -l nodes=1:ppn=%d" .Cpus}} + {{- end}} + {{if ne .RamGb 0.0 -}} + {{printf "#PBS -l mem=%.0fgb" .RamGb}} + {{- end}} + {{if ne .DiskGb 0.0 -}} + {{printf "#PBS -l file=%.0fgb" .DiskGb}} + {{- end}} + + funnel worker run --taskID {{.TaskId}} +``` +{% endraw %} +The following variables are available for use in the template: + +| Variable | Description | +|:------------|:-------------| +|TaskId | funnel task id | +|WorkDir | funnel working directory | +|Cpus | requested cpu cores | +|RamGb | requested ram | +|DiskGb | requested free disk space | +|Zone | requested zone (could be used for queue name) | + +See https://golang.org/pkg/text/template for information on creating templates. + +[pbs]: https://hpc-wiki.info/hpc/Torque diff --git a/docs/tools/funnel/docs/compute/slurm.md b/docs/tools/funnel/docs/compute/slurm.md new file mode 100644 index 0000000..98f1697 --- /dev/null +++ b/docs/tools/funnel/docs/compute/slurm.md @@ -0,0 +1,57 @@ +--- +title: Slurm +menu: + main: + parent: Compute + weight: 20 +--- +# Slurm + +Funnel can be configured to submit workers to [Slurm][slurm] by making calls +to `sbatch`. + +The Funnel server needs to run on a submission node. +Configure Funnel to use Slurm by including the following config: + +It is recommended to update the submit file template so that the +`funnel worker run` command takes a config file as an argument +(e.g. `funnel worker run --config /opt/funnel_config.yml --taskID {% raw %}{{.TaskId}}{% endraw %}`) + +{% raw %} +```YAML +Compute: slurm + +Slurm: + Template: | + #!/bin/bash + #SBATCH --job-name {{.TaskId}} + #SBATCH --ntasks 1 + #SBATCH --error {{.WorkDir}}/funnel-stderr + #SBATCH --output {{.WorkDir}}/funnel-stdout + {{if ne .Cpus 0 -}} + {{printf "#SBATCH --cpus-per-task %d" .Cpus}} + {{- end}} + {{if ne .RamGb 0.0 -}} + {{printf "#SBATCH --mem %.0fGB" .RamGb}} + {{- end}} + {{if ne .DiskGb 0.0 -}} + {{printf "#SBATCH --tmp %.0fGB" .DiskGb}} + {{- end}} + + funnel worker run --taskID {{.TaskId}} +``` +{% endraw %} +The following variables are available for use in the template: + +| Variable | Description | +|:------------|:-------------| +|TaskId | funnel task id | +|WorkDir | funnel working directory | +|Cpus | requested cpu cores | +|RamGb | requested ram | +|DiskGb | requested free disk space | +|Zone | requested zone (could be used for queue name) | + +See https://golang.org/pkg/text/template for information on creating templates. + +[slurm]: https://slurm.schedmd.com/ diff --git a/docs/tools/funnel/docs/databases.md b/docs/tools/funnel/docs/databases.md new file mode 100644 index 0000000..5eeb638 --- /dev/null +++ b/docs/tools/funnel/docs/databases.md @@ -0,0 +1,8 @@ +--- +title: Databases +menu: + main: + weight: 5 +--- + +# Databases diff --git a/docs/tools/funnel/docs/databases/boltdb.md b/docs/tools/funnel/docs/databases/boltdb.md new file mode 100644 index 0000000..ea5885e --- /dev/null +++ b/docs/tools/funnel/docs/databases/boltdb.md @@ -0,0 +1,24 @@ +--- +title: Embedded +menu: + main: + parent: Databases + weight: -10 +--- + +# Embedded + +By default, Funnel uses an embedded database named [BoltDB][bolt] to store task +and scheduler data. This is great for development and a simple server without +external dependencies, but it doesn't scale well to larger clusters. + +Available config: +```yaml +Database: boltdb + +BoltDB: + # Path to database file + Path: ./funnel-work-dir/funnel.db +``` + +[bolt]: https://github.com/boltdb/bolt diff --git a/docs/tools/funnel/docs/databases/datastore.md b/docs/tools/funnel/docs/databases/datastore.md new file mode 100644 index 0000000..ea31d8c --- /dev/null +++ b/docs/tools/funnel/docs/databases/datastore.md @@ -0,0 +1,94 @@ +--- +title: Datastore +menu: + main: + parent: Databases +--- + +# Google Cloud Datastore + +Funnel supports storing tasks (but not scheduler data) in Google Cloud Datastore. + +This implementation currently doesn't work with Appengine, since Appengine places +special requirements on the context of requests and requires a separate library. + +Two entity types are used, "Task" and "TaskPart" (for larger pieces of task content, +such as stdout/err logs). + +Funnel will, by default, try to automatically load credentials from the +environment. Alternatively, you may explicitly set the credentials in the config. +You can read more about providing the credentials +[here](https://cloud.google.com/docs/authentication/application-default-credentials). + +Config: +```yaml +Database: datastore + +Datastore: + Project: "" + # Path to account credentials file. + # Optional. If possible, credentials will be automatically discovered + # from the environment. + CredentialsFile: "" +``` + +Please also import some [composite +indexes](https://cloud.google.com/datastore/docs/concepts/indexes?hl=en) +to support the task-list queries. +This is typically done through command-line by referencing an **index.yaml** +file (do not change the filename) with the following content: + +```shell +gcloud datastore indexes create path/to/index.yaml --database='funnel' +``` + +```yaml +indexes: + +- kind: Task + properties: + - name: Owner + - name: State + - name: TagStrings + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: Owner + - name: State + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: Owner + - name: TagStrings + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: Owner + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: State + - name: TagStrings + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: State + - name: CreationTime + direction: desc + +- kind: Task + properties: + - name: TagStrings + - name: CreationTime + direction: desc +``` \ No newline at end of file diff --git a/docs/tools/funnel/docs/databases/dynamodb.md b/docs/tools/funnel/docs/databases/dynamodb.md new file mode 100644 index 0000000..3e536c2 --- /dev/null +++ b/docs/tools/funnel/docs/databases/dynamodb.md @@ -0,0 +1,30 @@ +--- +title: DynamoDB +menu: + main: + parent: Databases +--- + +# DynamoDB + +Funnel supports storing task data in DynamoDB. Storing scheduler data is not supported currently, so using the node scheduler with DynamoDB won't work. Using AWS Batch for compute scheduling may be a better option. +Funnel will, by default, try to automatically load credentials from the environment. Alternatively, you may explicitly set the credentials in the config. + +Available Config: +```yaml +Database: dynamodb + +DynamoDB: + # Basename to use for dynamodb tables + TableBasename: "funnel" + # AWS region + Region: "us-west-2" + # AWS Access key ID + Key: "" + # AWS Secret Access Key + Secret: "" +``` + +### Known issues + +Dynamo does not store scheduler data. See [issue 340](https://github.com/ohsu-comp-bio/funnel/issues/340). diff --git a/docs/tools/funnel/docs/databases/elasticsearch.md b/docs/tools/funnel/docs/databases/elasticsearch.md new file mode 100644 index 0000000..e397348 --- /dev/null +++ b/docs/tools/funnel/docs/databases/elasticsearch.md @@ -0,0 +1,30 @@ +--- +title: Elasticsearch +menu: + main: + parent: Databases +--- + +# Elasticsearch + +Funnel supports storing tasks and scheduler data in Elasticsearch (v8). + +Config: +```yaml +Database: elastic + +Elastic: + # Prefix to use for indexes + IndexPrefix: "funnel" + URL: http://localhost:9200 + # Optional. Username for HTTP Basic Authentication. + Username: + # Optional. Password for HTTP Basic Authentication. + Password: + # Optional. Endpoint for the Elastic Service (https://elastic.co/cloud). + CloudID: + # Optional. Base64-encoded token for authorization; if set, overrides username/password and service token. + APIKey: + # Optional. Service token for authorization; if set, overrides username/password. + ServiceToken: +``` diff --git a/docs/tools/funnel/docs/databases/mongodb.md b/docs/tools/funnel/docs/databases/mongodb.md new file mode 100644 index 0000000..4a6e8ab --- /dev/null +++ b/docs/tools/funnel/docs/databases/mongodb.md @@ -0,0 +1,24 @@ +--- +title: MongoDB +menu: + main: + parent: Databases +--- + +# MongoDB + +Funnel supports storing tasks and scheduler data in MongoDB. + +Config: +```yaml +Database: mongodb + +MongoDB: + # Addresses for the seed servers. + Addrs: + - "localhost" + # Database name used within MongoDB to store funnel data. + Database: "funnel" + Username: "" + Password: "" +``` diff --git a/docs/tools/funnel/docs/development.md b/docs/tools/funnel/docs/development.md new file mode 100644 index 0000000..1412f0d --- /dev/null +++ b/docs/tools/funnel/docs/development.md @@ -0,0 +1,8 @@ +--- +title: Development +menu: + main: + weight: 30 +--- + +# Development diff --git a/docs/tools/funnel/docs/development/developers.md b/docs/tools/funnel/docs/development/developers.md new file mode 100644 index 0000000..2d19f0a --- /dev/null +++ b/docs/tools/funnel/docs/development/developers.md @@ -0,0 +1,97 @@ +--- +title: Funnel Developers + +menu: + main: + parent: Development + weight: 30 +--- + +# Developers + +This page contains a rough collection of notes for people wanting to build Funnel from source and/or edit the code. + +### Building the Funnel source + +1. Install [Go 1.21+][go]. Check the version with `go version`. +2. Ensure GOPATH is set. See [the docs][gopath] for help. Also, you probably want to add `$GOPATH/bin` to your `PATH`. +3. Clone funnel and build + + ```shell + $ git clone https://github.com/ohsu-comp-bio/funnel.git + $ cd funnel + $ make + ``` + +4. Funnel is now downloaded and installed. Try `funnel version`. +5. You can edit the code and run `make install` to recompile. + +### Developer Tools + +A Funnel development environment includes: + +- [Go 1.21+][go] for the majority of the code. +- [Task Execution Schemas][tes] for task APIs. +- [Protobuf][protobuf] + [gRPC][grpc] for RPC communication. +- [gRPC Gateway][gateway] for HTTP communication. +- [Angular][angular] and [SASS][sass] for the web dashboard. +- [GNU Make][make] for development tasks. +- [Docker][docker] for executing task containers (tested with v1.12, v1.13). +- [dep][dep] for Go dependency vendoring. +- [Make][make] for development/build commands. +- [NodeJS][node] and [NPM][npm] for web dashboard development. + +### Makefile + +Most development tasks are run through `make` commands, including build, release, testing, website docs, lint, tidy, webdash dev, and more. See the [Makefile](https://github.com/ohsu-comp-bio/funnel/blob/master/Makefile) for an up-to-date list of commands. + +### Go Tests + +Run all tests: `make test` +Run the worker tests: `go test ./worker/...` +Run the worker tests with "Cancel" in the name: `go test ./worker -run Cancel` + +You get the idea. See the `go test` docs for more. + +### Mocking + +The [testify][testify] and [mockery][mockery] tools are used to generate and use +mock interfaces in test code, for example, to mock the Google Cloud APIs. + +[go]: https://golang.org +[angular]: https://angularjs.org/ +[protobuf]: https://github.com/google/protobuf +[grpc]: http://www.grpc.io/ +[sass]: http://sass-lang.com/ +[make]: https://www.gnu.org/software/make/ +[docker]: https://docker.io +[python]: https://www.python.org/ +[dep]: https://golang.github.io/dep/ +[node]: https://nodejs.org +[npm]: https://www.npmjs.com/ +[gateway]: https://github.com/grpc-ecosystem/grpc-gateway +[tes]: https://github.com/ga4gh/task-execution-schemas +[testify]: https://github.com/stretchr/testify +[mockery]: https://github.com/vektra/mockery +[gopath]: https://golang.org/doc/code.html#GOPATH + +### Making a release + +- Update Makefile, edit `FUNNEL_VERSION` and `LAST_PR_NUMBER` + - `LAST_PR_NUMBER` can be found by looking at the previous release notes + from the previous release. +- Run `make website`, which updates the download links and other content. + - Check the website locally by running `make website-dev` +- Commit these changes. + - Because goreleaser requires a clean working tree in git + - This is a special case where it's easiest to commit to master. +- Create a git tag: `git tag X.Y.Z` +- Run `make release` + - This will build cross-platform binaries, build release notes, + and draft an unpublished GitHub release. + - Check the built artifacts by downloading the tarballs from the GitHub draft release + and running `funnel version`. +- `git push origin master` to push your website and release changes. +- A tagged docker image for the release will be built automatically on [dockerhub](https://hub.docker.com/repository/docker/quay.io/ohsu-comp-bio/funnel). +- Publish the draft release on GitHub. +- Copy `build/release/funnel.rb` to the `ohsu-comp-bio/homebrew-formula/Formula/funnel.rb` Homebrew formula repo, and push those changes to master. diff --git a/docs/tools/funnel/docs/events.md b/docs/tools/funnel/docs/events.md new file mode 100644 index 0000000..87941ef --- /dev/null +++ b/docs/tools/funnel/docs/events.md @@ -0,0 +1,7 @@ +--- +title: Events +menu: + main: + weight: 5 +--- +# Events diff --git a/docs/tools/funnel/docs/events/kafka.md b/docs/tools/funnel/docs/events/kafka.md new file mode 100644 index 0000000..242ce93 --- /dev/null +++ b/docs/tools/funnel/docs/events/kafka.md @@ -0,0 +1,22 @@ +--- +title: Kafka +menu: + main: + parent: Events +--- + +# Kafka + +Funnel supports writing task events to a Kafka topic. To use this, add an event +writer to the config: + +```yaml +EventWriters: + - kafka + - log + +Kafka: + Servers: + - localhost:9092 + Topic: funnel-events +``` diff --git a/docs/tools/funnel/docs/integrations/nextflow.md b/docs/tools/funnel/docs/integrations/nextflow.md new file mode 100644 index 0000000..3090a94 --- /dev/null +++ b/docs/tools/funnel/docs/integrations/nextflow.md @@ -0,0 +1,100 @@ +--- +title: Nextflow +menu: + main: + parent: Integrations +--- + +> ⚠️ Nextflow support is currently in development and requires a few additional steps to run which are included below. + +# Nextflow + +[Nextflow](https://nextflow.io/) is a workflow engine with a [rich ecosystem]() of pipelines centered around biological analysis. + +> Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. + +> Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters. + +Since Nextflow [includes support](https://www.nextflow.io/docs/latest/executor.html#ga4gh-tes) for the TES API, it can be used in conjunction with Funnel to run tasks or to interact with a common TES endpoint. + +## Getting Started + +To set up Nextflow to use Funnel as the TES executor, run the following steps: + +### 1. Install Nextflow + +*Adapted from the [Nextflow Documentation](https://nextflow.io/docs/latest/install.html)* + +#### a. Install Nextflow: + +```sh +curl -s https://get.nextflow.io | bash +``` + +This will create the nextflow executable in the current directory. + +#### b. Make Nextflow executable: + +```sh +chmod +x nextflow +``` + +#### c. Move Nextflow into an executable path: + +```sh +sudo mv nextflow /usr/local/bin +``` + +#### d. Confirm that Nextflow is installed correctly: + +```sh +nextflow info +``` + +### 2. Update Nextflow Config + +Add the following to your `nextflow.config` in order to use the GA4GH TES plugin: + +```yaml +cat <> nextflow.config +plugins { + id 'nf-ga4gh' +} + +process.executor = 'tes' +tes.endpoint = 'http://localhost:8000' # <--- Funnel's default address +EOF +``` + +### 3. Start the Funnel Server + +Start the Funnel server: + +```sh +funnel server run +``` + +### 4. Run Nextflow + +In another window, run the workflow: + +```sh +nextflow run main.nf -c nextflow.config +``` + +## Additional Resources + +- [Nextflow Homepage](https://nextflow.io/) + +- [Nextflow Documentation](https://www.nextflow.io/docs) + +- [Nextflow's TES Support](https://www.nextflow.io/docs/latest/executor.html#ga4gh-tes) + +- [nf-core](https://nf-co.re/) + > A community effort to collect a curated set of analysis pipelines built using Nextflow. + +- [nf-canary](https://github.com/seqeralabs/nf-canary) + > A minimal Nextflow workflow for testing infrastructure. + +- [Nextflow Patterns](https://nextflow-io.github.io/patterns/) + > A curated collection of Nextflow implementation patterns diff --git a/docs/tools/funnel/docs/integrations/py-tes.md b/docs/tools/funnel/docs/integrations/py-tes.md new file mode 100644 index 0000000..7b12061 --- /dev/null +++ b/docs/tools/funnel/docs/integrations/py-tes.md @@ -0,0 +1,50 @@ +--- +title: py-tes +menu: + main: + parent: Integrations +--- + +> ⚠️ py-tes support is in active development and may be subject to change. + +# py-tes + +[py-tes](https://github.com/ohsu-comp-bio/py-tes) is a library for interacting with servers implementing the [GA4GH Task Execution Schema](https://github.com/ga4gh/task-execution-schemas). + +## Getting Started + +### Install + +Available on [PyPI](https://pypi.org/project/py-tes/). + +```sh +pip install py-tes +``` + +### Example Python Script + +```py +import tes + +task = tes.Task( + executors=[ + tes.Executor( + image="alpine", + command=["echo", "hello"] + ) + ] +) + +cli = tes.HTTPClient("http://funnel.example.com", timeout=5) +task_id = cli.create_task(task) +res = cli.get_task(task_id) +cli.cancel_task(task_id) +``` + +## Additional Resources + +- [py-tes Homepage](https://github.com/ohsu-comp-bio/py-tes) + +- [py-tes Documentation](https://ohsu-comp-bio.github.io/py-tes/) + +- [py-tes on PyPi](https://pypi.org/project/py-tes/) diff --git a/docs/tools/funnel/docs/metrics.md b/docs/tools/funnel/docs/metrics.md new file mode 100644 index 0000000..1077112 --- /dev/null +++ b/docs/tools/funnel/docs/metrics.md @@ -0,0 +1,8 @@ +--- +title: Metrics +menu: + main: + identifier: Metrics + weight: 6 +--- +# Metrics diff --git a/docs/tools/funnel/docs/metrics/prometheus.md b/docs/tools/funnel/docs/metrics/prometheus.md new file mode 100644 index 0000000..1b3495b --- /dev/null +++ b/docs/tools/funnel/docs/metrics/prometheus.md @@ -0,0 +1,36 @@ +--- +title: Prometheus +menu: + main: + parent: Metrics +--- + +# Prometheus + +[Prometheus][prom] is a monitoring and metrics collection service. It pulls metrics +from various "exporters", collects them in a time-series database, provides +a query langauge for access that data, and integrates closely with tools +such as [Grafana][graf] for visualization and dashboard building. + +Funnel exports these metrics: + +- `funnel_tasks_state_count`: the number of tasks + in each state (queued, running, etc). +- `funnel_nodes_state_count`: the number of nodes + in each state (alive, dead, draining, etc). +- `funnel_nodes_total_cpus`: the total number + of CPUs available by all nodes. +- `funnel_nodes_total_ram_bytes`: the total number + of bytes of RAM available by all nodes. +- `funnel_nodes_total_disk_bytes`: the total number + of bytes of disk space available by all nodes. +- `funnel_nodes_available_cpus`: the available number + of CPUs available by all nodes. +- `funnel_nodes_available_ram_bytes`: the available number + of bytes of RAM available by all nodes. +- `funnel_nodes_available_disk_bytes`: the available number + of bytes of disk space available by all nodes. + +[prom]: https://prometheus.io/ +[gauge]: https://prometheus.io/docs/concepts/metric_types/#gauge +[graf]: https://grafana.com/ diff --git a/docs/tools/funnel/docs/security.md b/docs/tools/funnel/docs/security.md new file mode 100644 index 0000000..c3dba45 --- /dev/null +++ b/docs/tools/funnel/docs/security.md @@ -0,0 +1,8 @@ +--- +title: Security +menu: + main: + weight: 10 +--- + +# Security diff --git a/docs/tools/funnel/docs/security/advanced.md b/docs/tools/funnel/docs/security/advanced.md new file mode 100644 index 0000000..3864e34 --- /dev/null +++ b/docs/tools/funnel/docs/security/advanced.md @@ -0,0 +1,29 @@ +--- +title: Advanced Auth +menu: + main: + parent: Security + weight: 10 +--- + +# Overview 🔐 + +Thanks to our collaborators at CTDS — Funnel is currently adding support for "Per-User/Per-Bucket" credentials to allow Users to access S3 Buckets without having to store their credentials in the Funnel Server. + +The high level overview of this feature will be such Funnel will be able to speak with a custom credential "Wrapper Script" that will: + +- Take the User Credentials +- Create an S3 Bucket +- Generate a Key (optionally for use in Nextflow Config) +- Send the Key to Funnel + +In this way this Wrapper can manage the bucket and the keys (the Wrapper would be the middleware between the User and Funnel). + +Stay tuned for this feature's development! This feature is being tracked with the following: + +- GitHub Branch: https://github.com/ohsu-comp-bio/funnel/tree/feature/credentials +- Pull Request: https://github.com/ohsu-comp-bio/funnel/pull/1098 + +# Credits 🙌 + +This feature and its development would not be possible without our continuing collaboration with [Pauline Ribeyre](https://github.com/paulineribeyre), [Jawad Qureshi](https://github.com/jawadqur), [Michael Fitzsimons](https://www.linkedin.com/in/michael-fitzsimons-ab8a6111), and the entire [CTDS](https://ctds.uchicago.edu) team at the [University of Chicago](https://www.uchicago.edu/)! diff --git a/docs/tools/funnel/docs/security/basic.md b/docs/tools/funnel/docs/security/basic.md new file mode 100644 index 0000000..0b19e07 --- /dev/null +++ b/docs/tools/funnel/docs/security/basic.md @@ -0,0 +1,59 @@ +--- +title: Basic Auth +menu: + main: + parent: Security + weight: 10 +--- +# Basic Auth + +By default, a Funnel server allows open access to its API endpoints, but it +can be configured to require basic password authentication. To enable this, +include users and passwords in your config file: + +```yaml +Server: + BasicAuth: + - User: admin + Password: someReallyComplexSecret + Admin: true + - User: funnel + Password: abc123 + + TaskAccess: OwnerOrAdmin +``` + +The `TaskAccess` property configures the visibility and access-mode for tasks: + +* `All` (default) - all tasks are visible to everyone +* `Owner` - tasks are visible to the users who created them +* `OwnerOrAdmin` - extends `Owner` by allowing Admin-users (`Admin: true`) + access everything + +As new tasks are created, the username behind the request is recorded as the +owner of the task. Depending on the `TaskAccess` property, if owner-based +acces-mode is enabled, the owner of the task is compared to username of current +request to decide if the user may see and interact with the task. + +If you are using BoltDB or Badger, the Funnel worker communicates to the server via gRPC +so you will also need to configure the RPC client. + +```yaml +RPCClient: + User: funnel + Password: abc123 +``` + +Make sure to properly protect the configuration file so that it's not readable +by everyone: + +```bash +$ chmod 600 funnel.config.yml +``` + +To use the password, set the `FUNNEL_SERVER_USER` and `FUNNEL_SERVER_PASSWORD` environment variables: +```bash +$ export FUNNEL_SERVER_USER=funnel +$ export FUNNEL_SERVER_PASSWORD=abc123 +$ funnel task list +``` diff --git a/docs/tools/funnel/docs/security/oauth2.md b/docs/tools/funnel/docs/security/oauth2.md new file mode 100644 index 0000000..4b4232d --- /dev/null +++ b/docs/tools/funnel/docs/security/oauth2.md @@ -0,0 +1,74 @@ +--- +title: OAuth2 +menu: + main: + parent: Security + weight: 10 +--- +# OAuth2 + +By default, a Funnel server allows open access to its API endpoints, but in +addition to Basic authentication it can also be configured to require a valid +JWT in the request. + +Funnel itself does not redirect users to perform the login. +It just validates that the presented token is issued by a trusted service +(specified in the YAML configuration file) and the token has not expired. +In addition, if the OIDC provides a token introspection endpoint (in its +configuration JSON), Funnel server also calls that endpoint to make sure the +token is still active (i.e., no token invalidation before expiring). + +Optionally, Funnel can also validate the scope and audience claims to contain +specific values. + +To enable JWT authentication, specify `OidcAuth` section in your config file: + +```yaml +Server: + OidcAuth: + # URL of the OIDC service configuration: + ServiceConfigURL: "https://my.oidc.service/.well-known/openid-configuration" + + # Client ID and secret are sent with the token introspection request + # (Basic authentication): + ClientId: your-client-id + ClientSecret: your-client-secret + + # Optional: if specified, this scope value must be in the token: + RequireScope: funnel-id + + # Optional: if specified, this audience value must be in the token: + RequireAudience: tes-api + + # The URL where OIDC should redirect after login (keep the path '/login') + RedirectURL: "http://localhost:8000/login" + + # List of OIDC subjects promoted to Admin status. + Admins: + - user.one@example.org + - user.two@example.org + + TaskAccess: OwnerOrAdmin +``` + +The `TaskAccess` property configures the visibility and access-mode for tasks: + +* `All` (default) - all tasks are visible to everyone +* `Owner` - tasks are visible to the users who created them +* `OwnerOrAdmin` - extends `Owner` by allowing Admin-users (defined under + `Admins`) access everything + +As new tasks are created, the username behind the request is recorded as the +owner of the task. Depending on the `TaskAccess` property, if owner-based +acces-mode is enabled, the owner of the task is compared to username of current +request to decide if the user may see and interact with the task. + +Make sure to properly protect the configuration file so that it's not readable +by everyone: + +```bash +$ chmod 600 funnel.config.yml +``` + +Note that the Funnel UI supports login through an OIDC service. However, OIDC +authentication is not supported at command-line. diff --git a/docs/tools/funnel/docs/storage.md b/docs/tools/funnel/docs/storage.md new file mode 100644 index 0000000..9297161 --- /dev/null +++ b/docs/tools/funnel/docs/storage.md @@ -0,0 +1,8 @@ +--- +title: Storage +menu: + main: + identifier: Storage + weight: -10 +--- +# Storage diff --git a/docs/tools/funnel/docs/storage/ftp.md b/docs/tools/funnel/docs/storage/ftp.md new file mode 100644 index 0000000..79b2439 --- /dev/null +++ b/docs/tools/funnel/docs/storage/ftp.md @@ -0,0 +1,38 @@ +--- +title: FTP +menu: + main: + parent: Storage +--- + +# FTP + +Funnel supports download and uploading files via FTP. + +Currently authentication credentials are take from the URL, e.g. `ftp://username:password@ftp.host.tld`. This will be improved soon to allow credentials to be added to the configuration file. + +The FTP storage client is enabled by default, but may be explicitly disabled in the +worker config: + +```yaml +FTPStorage: + Disabled: false +``` + +### Example task +```json +{ + "name": "Hello world", + "inputs": [{ + "url": "ftp://my.ftpserver.xyz/hello.txt", + "path": "/inputs/hello.txt" + }, { + "url": "ftp://user:mypassword123@my.ftpserver.xyz/hello.txt", + "path": "/inputs/hello.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + }] +} +``` diff --git a/docs/tools/funnel/docs/storage/google-storage.md b/docs/tools/funnel/docs/storage/google-storage.md new file mode 100644 index 0000000..d8fde4f --- /dev/null +++ b/docs/tools/funnel/docs/storage/google-storage.md @@ -0,0 +1,43 @@ +--- +title: Google Storage +menu: + main: + parent: Storage +--- + +# Google Storage + +Funnel supports using [Google Storage][gs] (GS) for file storage. + +The Google storage client is enabled by default, and will try to automatically +load credentials from the environment. Alternatively, you +may explicitly set the credentials in the worker config: + +```yaml +GoogleStorage: + Disabled: false + # Path to account credentials file. + AccountFile: "" +``` + +### Example task +```json +{ + "name": "Hello world", + "inputs": [{ + "url": "gs://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + }], + "outputs": [{ + "url": "gs://funnel-bucket/output.txt", + "path": "/outputs/hello-out.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + "stdout": "/outputs/hello-out.txt", + }] +} +``` + +[gs]: https://cloud.google.com/storage/ diff --git a/docs/tools/funnel/docs/storage/http.md b/docs/tools/funnel/docs/storage/http.md new file mode 100644 index 0000000..8192205 --- /dev/null +++ b/docs/tools/funnel/docs/storage/http.md @@ -0,0 +1,37 @@ +--- +title: HTTP(S) +menu: + main: + parent: Storage +--- + +# HTTP(S) + +Funnel supports downloading files from public URLs via GET requests. No authentication +mechanism is allowed. This backend can be used to fetch objects from cloud storage +providers exposed using presigned URLs. + +The HTTP storage client is enabled by default, but may be explicitly disabled in the +worker config: + +```yaml +HTTPStorage: + Disabled: false + # Timeout for http(s) GET requests. + Timeout: 30s +``` + +### Example task +```json +{ + "name": "Hello world", + "inputs": [{ + "url": "http://fakedomain.com/hello.txt", + "path": "/inputs/hello.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + }] +} +``` diff --git a/docs/tools/funnel/docs/storage/local.md b/docs/tools/funnel/docs/storage/local.md new file mode 100644 index 0000000..eb68669 --- /dev/null +++ b/docs/tools/funnel/docs/storage/local.md @@ -0,0 +1,63 @@ +--- +title: Local +menu: + main: + parent: Storage + weight: -10 +--- + +# Local + +Funnel supports using the local filesystem for file storage. + +Funnel limits which directories may be accessed, by default only allowing directories +under the current working directory of the Funnel worker. + +Config: +```yaml +LocalStorage: + # Whitelist of local directory paths which Funnel is allowed to access. + AllowedDirs: + - ./ + - /path/to/allowed/dir + - ...etc +``` + +### Example task + +Files must be absolute paths in `file:///path/to/file.txt` URL form. + +``` +{ + "name": "Hello world", + "inputs": [{ + "url": "file:///path/to/funnel-data/hello.txt", + "path": "/inputs/hello.txt" + }], + "outputs": [{ + "url": "file:///path/to/funnel-data/output.txt", + "path": "/outputs/hello-out.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + "stdout": "/outputs/hello-out.txt", + }] +} +``` + +### File hard linking behavior + +For efficiency, Funnel will attempt not to copy the input files, instead trying +create a hard link to the source file. In some cases this isn't possible. For example, +if the source file is on a network file system mount (e.g. NFS) but the Funnel worker's +working directory is on the local scratch disk, a hard link would cross a file system +boundary, which is not possible. In this case, Funnel will copy the file. + +### File ownership behavior + +One difficult area of files and Docker containers is file owner/group management. +If a Docker container runs as root, it's likely that the file will end up being owned +by root on the host system. In this case, some step (Funnel or another task) will +likely fail to access it. This is a tricky problem with no good solution yet. +See [issue 66](https://github.com/ohsu-comp-bio/funnel/issues/66). diff --git a/docs/tools/funnel/docs/storage/s3.md b/docs/tools/funnel/docs/storage/s3.md new file mode 100644 index 0000000..75a0271 --- /dev/null +++ b/docs/tools/funnel/docs/storage/s3.md @@ -0,0 +1,96 @@ +--- +title: S3 +menu: + main: + parent: Storage +--- + +# S3 + +## Amazon S3 + +Funnel supports using [AWS S3](https://aws.amazon.com/s3/) for file storage. + +The Amazon S3 storage client is enabled by default, and will try to automatically +load credentials from the environment. Alternatively, you +may explicitly set the credentials in the worker config: + +```yaml +AmazonS3: + Disabled: false + # The maximum number of times that a request will be retried for failures. + MaxRetries: 10 + Key: "" + Secret: "" +``` + +The Amazon S3 storage client also supports SSE-KMS and SSE-C configurations. + +For SSE-KMS as long as your credentials can access the KMS key used for the +given bucket, no special configuration is required. However, you can specifiy a +specific KMS key if desired: + +```yaml +AmazonS3: + SSE: + KMSKey: "1a03ce70-5f03-484e-8396-0e97de661b79" +``` + +For SSE-C: + +Generate a key file: + +```sh +openssl rand -out sse-c.key 32 +``` + +Then configure the storage client to use it: + +```yaml +AmazonS3: + SSE: + CustomerKeyFile: "./sse-c.key" +``` + +Note that this file will need to be available to all Funnel workers. + +## Other S3 API Providers + +Funnel also supports using non-Amazon S3 API providers ([Ceph][ceph], +[Cleversafe][cleversafe], [Minio][minio], etc.) for file storage. + +These other S3 storage clients are NOT enabled by default. You must configure them. + +This storage client also supports the [version 4 signing process](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html). + +```yaml +GenericS3: + - Disabled: false + Endpoint: "" + Key: "" + Secret: "" +``` + +### Example task +```json +{ + "name": "Hello world", + "inputs": [{ + "url": "s3://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + }], + "outputs": [{ + "url": "s3://funnel-bucket/output.txt", + "path": "/outputs/hello-out.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + "stdout": "/outputs/hello-out.txt" + }] +} +``` + +[ceph]: http://ceph.com/ +[cleversafe]: https://www.ibm.com/cloud/object-storage +[minio]: https://minio.io/ diff --git a/docs/tools/funnel/docs/storage/swift.md b/docs/tools/funnel/docs/storage/swift.md new file mode 100644 index 0000000..3de323a --- /dev/null +++ b/docs/tools/funnel/docs/storage/swift.md @@ -0,0 +1,53 @@ +--- +title: OpenStack Swift +menu: + main: + parent: Storage +--- + +# OpenStack Swift + +Funnel supports using [OpenStack Swift][swift] for file storage. + +The Swift storage client is enabled by default, and will try to automatically +load credentials from the environment. Alternatively, you +may explicitly set the credentials in the worker config: + +```yaml +Swift: + Disabled: false + UserName: "" + Password: "" + AuthURL: "" + TenantName: "" + TenantID: "" + RegionName: "" + # 500 MB + ChunkSizeBytes: 500000000 +``` + +### Example task +```json +{ + "name": "Hello world", + "inputs": [{ + "url": "swift://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + }], + "outputs": [{ + "url": "swift://funnel-bucket/output.txt", + "path": "/outputs/hello-out.txt" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + "stdout": "/outputs/hello-out.txt", + }] +} +``` + +### Known Issues: + +The config currently only supports OpenStack v2 auth. See [issue #336](https://github.com/ohsu-comp-bio/funnel/issues/336). + +[swift]: https://docs.openstack.org/swift/latest/ diff --git a/docs/tools/funnel/docs/tasks.md b/docs/tools/funnel/docs/tasks.md new file mode 100644 index 0000000..ed9d25f --- /dev/null +++ b/docs/tools/funnel/docs/tasks.md @@ -0,0 +1,494 @@ +--- +title: Tasks +menu: + main: + identifier: tasks + weight: -70 +--- + +# Tasks + +A task defines a unit of work: + +- metadata +- input files to download +- a sequence of Docker containers + commands to run, +- output files to upload +- state +- logs + +The example task below downloads a file named `hello.txt` from S3 and calls `cat hello.txt` using the [alpine][alpine] container. This task also writes the executor's stdout to a file, and uploads the stdout to s3. + +``` +{ + "name": "Hello world", + "inputs": [{ + # URL to download file from. + "url": "s3://funnel-bucket/hello.txt", + # Path to download file to. + "path": "/inputs/hello.txt" + }], + "outputs": [{ + # URL to upload file to. + "url": "s3://funnel-bucket/output.txt", + # Local path to upload file from. + "path": "/outputs/stdout" + }], + "executors": [{ + # Container image name. + "image": "alpine", + # Command to run (argv). + "command": ["cat", "/inputs/hello.txt"], + # Capture the stdout of the command to /outputs/stdout + "stdout": "/outputs/stdout" + }] +} +``` + +Tasks have multiple "executors"; containers and commands run in a sequence. +Funnel runs executors via Docker. + +Tasks also have state and logs: +``` +{ + "id": "b85khc2rl6qkqbhg8vig", + "state": "COMPLETE", + "name": "Hello world", + "inputs": [ + { + "url": "s3://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + } + ], + "outputs": [ + { + "url": "s3://funnel-bucket/output.txt", + "path": "/outputs/stdout" + } + ], + "executors": [ + { + "image": "alpine", + "command": [ + "cat", + "/inputs/hello.txt" + ], + "stdout": "/outputs/stdout" + } + ], + "logs": [ + { + "logs": [ + { + "startTime": "2017-11-14T11:49:05.127885125-08:00", + "endTime": "2017-11-14T11:49:08.484461502-08:00", + "stdout": "Hello, Funnel!\n" + } + ], + "startTime": "2017-11-14T11:49:04.433593468-08:00", + "endTime": "2017-11-14T11:49:08.487707039-08:00" + } + ], + "creationTime": "2017-11-14T11:49:04.427163701-08:00" +} +``` + +There are logs for each task attempt and each executor. Notice that the stdout is +conveniently captured by `logs[0].logs[0].stdout`. + +### Task API + +The API lets you create, get, list, and cancel tasks. + +### Create +``` +POST /v1/tasks +{ + "name": "Hello world", + "inputs": [{ + "url": "s3://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + }], + "outputs": [{ + "url": "s3://funnel-bucket/output.txt", + "path": "/outputs/stdout" + }], + "executors": [{ + "image": "alpine", + "command": ["cat", "/inputs/hello.txt"], + "stdout": "/outputs/stdout" + }] +} + + +# The response is a task ID: +b85khc2rl6qkqbhg8vig +``` + +### Get +``` +GET /v1/tasks/b85khc2rl6qkqbhg8vig + +{"id": "b85khc2rl6qkqbhg8vig", "state": "COMPLETE"} +``` + +By default, the minimal task view is returned which describes only the ID and state. +In order to get the original task with some basic logs, use the "BASIC" task view: +``` +GET /v1/tasks/b85khc2rl6qkqbhg8vig?view=BASIC +{ + "id": "b85khc2rl6qkqbhg8vig", + "state": "COMPLETE", + "name": "Hello world", + "inputs": [ + { + "url": "gs://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + } + ], + "outputs": [ + { + "url": "s3://funnel-bucket/output.txt", + "path": "/outputs/stdout" + } + ], + "executors": [ + { + "image": "alpine", + "command": [ + "cat", + "/inputs/hello.txt" + ], + "stdout": "/outputs/stdout", + } + ], + "logs": [ + { + "logs": [ + { + "startTime": "2017-11-14T11:49:05.127885125-08:00", + "endTime": "2017-11-14T11:49:08.484461502-08:00", + } + ], + "startTime": "2017-11-14T11:49:04.433593468-08:00", + "endTime": "2017-11-14T11:49:08.487707039-08:00" + } + ], + "creationTime": "2017-11-14T11:49:04.427163701-08:00" +} +``` + +The "BASIC" doesn't include some fields such as stdout/err logs, because these fields may be potentially large. +In order to get everything, use the "FULL" view: +``` +GET /v1/tasks/b85khc2rl6qkqbhg8vig?view=FULL +{ + "id": "b85khc2rl6qkqbhg8vig", + "state": "COMPLETE", + "name": "Hello world", + "inputs": [ + { + "url": "gs://funnel-bucket/hello.txt", + "path": "/inputs/hello.txt" + } + ], + "executors": [ + { + "image": "alpine", + "command": [ + "cat", + "/inputs/hello.txt" + ], + "stdout": "/outputs/stdout", + } + ], + "logs": [ + { + "logs": [ + { + "startTime": "2017-11-14T11:49:05.127885125-08:00", + "endTime": "2017-11-14T11:49:08.484461502-08:00", + "stdout": "Hello, Funnel!\n" + } + ], + "startTime": "2017-11-14T11:49:04.433593468-08:00", + "endTime": "2017-11-14T11:49:08.487707039-08:00" + } + ], + "creationTime": "2017-11-14T11:49:04.427163701-08:00" +} +``` + +### List +``` +GET /v1/tasks +{ + "tasks": [ + { + "id": "b85l8tirl6qkqbhg8vj0", + "state": "COMPLETE" + }, + { + "id": "b85khc2rl6qkqbhg8vig", + "state": "COMPLETE" + }, + { + "id": "b85kgt2rl6qkpuptua70", + "state": "SYSTEM_ERROR" + }, + { + "id": "b857gnirl6qjfou61fh0", + "state": "SYSTEM_ERROR" + } + ] +} +``` + +List has the same task views as Get: MINIMAL, BASIC, and FULL. + +The task list is paginated: +``` +GET /v1/tasks?page_token=1h123h12j2h3k +{ + "next_page_token": "1n3n1j23k12n3k123", + "tasks": [ + { + "id": "b85l8tirl6qkqbhg8vj0", + "state": "COMPLETE" + }, + # ... more tasks here ... + ] +} +``` + +### Cancel + +Tasks cannot be modified by the user after creation, with one exception – they can be canceled. +``` +POST /v1/tasks/b85l8tirl6qkqbhg8vj0:cancel +``` + + +### Full task spec + +Here's a more detailed description of a task. +For a full, in-depth spec, read the TES standard's [task_execution.proto](https://github.com/ga4gh/task-execution-schemas/blob/master/task_execution.proto). + +``` +{ + # The task's ID. Set by the server. + # Output only. + "id": "1234567", + + # The task's state. Possible states: + # QUEUED + # INITILIZING + # RUNNING + # PAUSED + # COMPLETE + # EXECUTOR_ERROR + # SYSTEM_ERROR + # CANCELED + # + # Output only. + "state": "QUEUED", + + # Metadata + "name": "Task name.", + "description": "Task description.", + "tags": { + "custom-tag-1": "tag-value-1", + "custom-tag-2": "tag-value-2", + }, + + # Resource requests + "resources": { + # Number of CPU cores requested. + "cpuCores": 1, + + # RAM request, in gigabytes. + "ramGb": 1.0, + + # Disk space request, in gigabytes. + "diskGb": 100.0, + + # Request preemptible machines, + # e.g. preemptible VM in Google Cloud, an instance from the AWS Spot Market, etc. + "preemptible": false, + + # Request that the task run in these compute zones. + "zones": ["zone1", "zone2"], + }, + + # Input files will be downloaded by the worker. + # This example uses s3, but Funnel supports multiple filesystems. + "inputs": [ + { + "name": "Input file.", + "description": "Input file description.", + + # URL to download file from. + "url": "s3://my-bucket/object/path/file.txt", + # Path to download file to. + "path": "/container/input.txt" + }, + { + "name": "Input directory.", + "description": "Directories are also supported.", + "url": "s3://my-bucket/my-data/", + "path": "/inputs/my-data/", + "type": "DIRECTORY" + }, + + # A task may include the file content directly in the task message. + # This is sometimes useful for small files such as scripts, + # which you want to include without talking directly to the filesystem. + { + "path": "/inputs/script.py", + "content": "import socket; print socket.gethostname()" + } + ], + + # Output files will be uploaded to storage by the worker. + "outputs": [ + { + "name": "Output file.", + "description": "Output file description.", + "url": "s3://my-bucket/output-data/results.txt", + "path": "/outputs/results.txt" + }, + { + "name": "Output directory.", + "description": "Directories are also supported.", + "url": "s3://my-bucket/output-data/output-dir/", + "path": "/outputs/data-dir/", + "type": "DIRECTORY" + } + ], + + # Executors define a sequence of containers + commands to run. + # Execution stop on the first non-zero exit code. + "executors": [ + { + # Container image name. + # Funnel supports running executor containers via Docker. + "image": "ubuntu", + + # Command arguments (argv). + # The first item is the executable to run. + "command": ["my-tool-1", "/container/input"], + + # Local file path to read stdin from. + "stdin": "/inputs/stdin.txt", + + # Local file path to write stdout to. + "stdout": "/container/output", + + # Local file path to write stderr to. + "stderr": "/container/stderr", + + # Set the working directory before executing the command. + "workdir": "/data/workdir", + + # Environment variables + "env": { + "ENV1": "value1", + "ENV2": "value2", + } + }, + + # Second executor runs after the first completes, on the same machine. + { + "image": "ubuntu", + "command": ["cat", "/container/input"], + "stdout": "/container/output", + "stderr": "/container/stderr", + "workdir": "/tmp" + } + ] + + # Date/time the task was created. + # Set the the server. + # Output only. + "creationTime": "2017-11-14T11:49:04.427163701-08:00" + + # Task logs. + # Output only. + # + # If there's a system error, the task may be attempted multiple times, + # so this field is a list of attempts. In most cases, there will be only + # one or zero entries here. + "logs": [ + + # Attempt start/end times, in RFC3339 format. + "startTime": "2017-11-14T11:49:04.433593468-08:00", + "endTime": "2017-11-14T11:49:08.487707039-08:00" + + # Arbitrary metadata set by Funnel. + "metadata": { + "hostname": "worker-1", + }, + + # Arbitrary system logs which Funnel thinks are useful to the user. + "systemLogs": [ + "task was assigned to worker 1", + "docker command: docker run -v /vol:/data alpine cmd arg1 arg2", + ], + + # Log of files uploaded to storage by the worker, + # including all files in directories, with file sizes. + "outputs": [ + { + "url": "s3://my-bucket/output-data/results.txt", + "path": "/outputs/results.txt", + "sizeBytes": 123 + }, + { + "url": "s3://my-bucket/output-data/output-dir/file1.txt", + "path": "/outputs/data-dir/file1.txt", + "sizeBytes": 123 + }, + { + "url": "s3://my-bucket/output-data/output-dir/file2.txt", + "path": "/outputs/data-dir/file2.txt", + "sizeBytes": 123 + } + { + "url": "s3://my-bucket/output-data/output-dir/subdir/file3.txt", + "path": "/outputs/data-dir/subdir/file3.txt", + "sizeBytes": 123 + } + ], + + # Executor logs. One entry per executor. + "logs": [ + { + # Executor start/end time, in RFC3339 format. + "startTime": "2017-11-14T11:49:05.127885125-08:00", + "endTime": "2017-11-14T11:49:08.484461502-08:00", + + # Executor stdout/err. Only available in the FULL task view. + # + # There is a size limit for these fields, which is configurable + # and defaults to 10KB. If more than 10KB is generated, only the + # tail will be logged. If the full output is needed, the task + # may use Executor.stdout and an output to upload the full content + # to storage. + "stdout": "Hello, Funnel!", + "stderr": "", + + # Exit code + "exit_code": 0, + }, + { + "startTime": "2017-11-14T11:49:05.127885125-08:00", + "endTime": "2017-11-14T11:49:08.484461502-08:00", + "stdout": "Hello, Funnel!\n" + } + ], + } + ], +} +``` + +[alpine]: https://hub.docker.com/_/alpine/ diff --git a/docs/tools/funnel/download.md b/docs/tools/funnel/download.md new file mode 100644 index 0000000..6a150d9 --- /dev/null +++ b/docs/tools/funnel/download.md @@ -0,0 +1,35 @@ +--- +title: Download +menu: + main: + weight: -2000 +--- + +## Releases + +See the [Releases](https://github.com/ohsu-comp-bio/funnel/releases) page for release history. + + +--8<-- "docs/tools/funnel/_releases.md" + +## Homebrew + +```sh +brew tap ohsu-comp-bio/formula +brew install funnel@0.11 +``` + +## Build the lastest development version + +In order to build the latest code, run: +```shell +$ git clone https://github.com/ohsu-comp-bio/funnel.git +$ cd funnel +$ make +``` + +Funnel requires Go 1.21+. Check out the [development docs][dev] for more detail. + + +[dev]: ./docs/development/developers.md +[docker]: https://docker.io diff --git a/docs/tools/funnel/index.md b/docs/tools/funnel/index.md new file mode 100644 index 0000000..4c780e5 --- /dev/null +++ b/docs/tools/funnel/index.md @@ -0,0 +1,197 @@ +--- +title: Funnel +--- + +## Funnel Tool Documentation + +The Funnel tool is designed to streamline data processing workflows, enabling efficient data transformation and analysis. Key features include: + +- **S3 Integration**: Seamlessly add and manage files from Amazon S3. +- **Data Transformation**: Predefined pipelines for common data processing tasks. +- **Automation**: Schedule and automate repetitive data workflows. +- **Monitoring**: Track the status and performance of data jobs in real-time. +- **Workflow engine compatibile**: Compatible with Nextflow + +## Simple API +A task describes metadata, state, input/output files, resource requests, commands, and logs. + +The task API has four actions: create, get, list, and cancel. + +Funnel serves both HTTP/JSON and gRPC/Protobuf. + +##Standards based + +The Task API is developed via an open standard effort. + +## Workers +Given a task, Funnel will queue it, schedule it to a worker, and track its state and logs. + +A worker will download input files, run a sequence of Docker containers, upload output files, and emits events and logs along the way. + +## Cross platform +We use Funnel on AWS, Google Cloud, OpenStack, and the good ol' university HPC cluster. + +## Adaptable + +A wide variety of options make Funnel easily adaptable: + + - BoltDB + - Elasticsearch + - MongoDB + - AWS Batch, S3, DynamoDB + - OpenStack Swift + - Google Cloud Storage, Datastore + - Kafka + - HPC support: HTCondor, Slurm, etc. + - and more + + --- + +# Define a task + +A task describes metadata, state, input/output files, resource requests, commands, and logs. + +For a full description of the task fields, see the task API docs and the the task schema. + + +``` +$ funnel examples hello-world +{ + "name": "Hello world", + "description": "Demonstrates the most basic echo task.", + "executors": [ + { + "image": "alpine", + "command": ["echo", "hello world"], + } + ] +} +``` + +--- + +# Start a Funnel server + +localhost:8000 is the HTTP API and web dashboard. +localhost:9090 is the gRPC API (for internal communication) + + +``` +$ funnel server run +server Server listening +httpPort 8000 +rpcAddress :9090 +``` + + +--- + +# Create a task + +The output is the task ID. + +This example uses the development server, which will run the task locally via Docker. + + +``` +$ funnel examples hello-world > hello-world.json +$ funnel task create hello-world.json +b8581farl6qjjnvdhqn0 +``` + +--- + +# Get the task + +The output is the task with state and logs. + +By default, the CLI returns the "full" task view, which includes all logs plus stdout/err content. + + +``` +$ funnel task get b8581farl6qjjnvdhqn0 +{ + "id": "b8581farl6qjjnvdhqn0", + "state": "COMPLETE", + "name": "Hello world", + "description": "Demonstrates the most basic echo task.", + "executors": [ + { + "image": "alpine", + "command": [ + "echo", + "hello world" + ], + } + ], + "logs": [ + { + "logs": [ + { + "startTime": "2017-11-13T21:35:57.548592769-08:00", + "endTime": "2017-11-13T21:36:01.871905687-08:00", + "stdout": "hello world\n" + } + ], + "startTime": "2017-11-13T21:35:57.547408797-08:00", + "endTime": "2017-11-13T21:36:01.87496482-08:00" + } + ], + "creationTime": "2017-11-13T21:35:57.543528992-08:00" +} +``` + +--- + +List the tasks + + +``` +$ funnel task list --view MINIMAL +{ + "tasks": [ + { + "id": "b8581farl6qjjnvdhqn0", + "state": "COMPLETE" + }, + ... + ] +} +``` + +--- + +# Quickly create tasks + +The "run" command makes it easy to quickly create a task. By default, commands are wrapped in "sh -c" and run in the "alpine" container. + +Use the "--print" flag to print the task instead of running it immediately. + + +``` +$ funnel run 'md5sum $src' --in src=~/src.txt --print +{ + "name": "sh -c 'md5sum $src'", + "inputs": [ + { + "name": "src", + "url": "file:///Users/buchanae/src.txt", + "path": "/inputs/Users/buchanae/src.txt" + } + ], + "executors": [ + { + "image": "alpine", + "command": [ + "sh", + "-c", + "md5sum $src" + ], + "env": { + "src": "/inputs/Users/buchanae/src.txt" + } + } + ], +} + +``` diff --git a/docs/tools/git-drs/.nav.yml b/docs/tools/git-drs/.nav.yml new file mode 100644 index 0000000..0db3d0c --- /dev/null +++ b/docs/tools/git-drs/.nav.yml @@ -0,0 +1,5 @@ +title: Git-DRS +nav: + - Quick Start: quickstart.md + - Troubleshooting: troubleshooting.md + - Developer Guide: developer-guide.md diff --git a/docs/tools/git-drs/developer-guide.md b/docs/tools/git-drs/developer-guide.md new file mode 100644 index 0000000..749ed0f --- /dev/null +++ b/docs/tools/git-drs/developer-guide.md @@ -0,0 +1,101 @@ +# Git DRS — How It Works + +This document describes the internal architecture, pointer file format, and supported cloud backends for Git DRS. For the user-facing command reference and getting started guide, see the [Git DRS Quickstart](quickstart.md). + +## How It Works + +Git DRS leverages the same **clean / smudge filter** mechanism and **custom transfer agent** protocol used by [Git LFS](https://git-lfs.com/). If you're familiar with [how Git LFS works under the hood](https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md), the following diagram shows where Git DRS fits in: + +```mermaid +sequenceDiagram + actor User + participant WD as Working Directory + participant Git as Git (Index) + participant LFS as LFS Server + participant DRS as DRS Server + participant Cloud as Cloud Storage + + Note over User,Cloud: Push workflow + User->>WD: git add my-file.bam + WD->>Git: clean filter (replace content with pointer) + User->>Git: git commit + User->>Git: git push + Git->>LFS: upload file content + Git->>DRS: register object (pre-push hook) + DRS->>Cloud: record access URL + + Note over User,Cloud: Clone / pull workflow + User->>Git: git clone / git fetch + Git->>WD: checkout pointer files + User->>WD: git drs init + User->>DRS: git drs remote add ... + User->>LFS: git lfs pull + LFS->>WD: smudge filter (restore full content) + + Note over User,Cloud: Query workflow + User->>DRS: git drs query + DRS-->>User: access URLs + metadata +``` + +**The workflow step by step:** + +1. **`git lfs track "*.bam"`** — Registers file patterns in `.gitattributes`. See [Git LFS track](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-track.adoc). +2. **`git add` / `git commit`** — Standard Git operations. The clean filter replaces file content with a small pointer file. +3. **`git push`** — Git LFS uploads objects to the LFS server. Git DRS hooks automatically register each object with the configured DRS server, making it discoverable by DRS ID. +4. **`git clone` / `git lfs pull`** — Git LFS downloads objects on demand. The smudge filter restores pointer files to their full content. See [Git LFS pull](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-pull.adoc). +5. **`git drs query `** — Look up any registered object by its DRS ID to retrieve access URLs and metadata. + +### Hooks Integration Table + +| Hook/Integration | Command | Purpose | +|------------------|---------|---------| +| **Pre-commit Hook** | `git drs precommit` | Triggered automatically before each commit
Processes all staged LFS files
Creates DRS records for new files
Only processes files that don't already exist on the DRS server
Prepares metadata for later upload during push | +| **Custom Transfer (upload)** | `git drs transfer` | Handles upload operations during `git push`
Creates indexd record on DRS server
Uploads file to Gen3-registered S3 bucket
Updates DRS object with access URLs | +| **Custom Transfer (download)** | `git drs transfer` | Handles download operations during `git lfs pull`
Retrieves file metadata from DRS server
Downloads file from configured storage
Validates checksums | + + +### Protocol Communication + +Git LFS and Git DRS communicate via JSON messages. Git LFS uses custom transfers to communicate with Git DRS, passing information through JSON protocol: + +```json +{ + "event": "init", + "operation": "upload", + "remote": "origin", + "concurrent": 3, + "concurrenttransfers": 3 +} +``` + +Response handling and logging occurs in transfer clients to avoid interfering with Git LFS stdout expectations. + +For more details, see the [Git LFS Custom Transfer Protocol](https://github.com/git-lfs/git-lfs/blob/main/docs/custom-transfers.md) documentation. + +## Configuration System + +Git DRS stores configuration in Git's local config (`.git/config`). + +**Example Configuration:** + +```bash +$ git config --list | grep drs +lfs.standalonetransferagent=drs +lfs.customtransfer.drs.args=transfer +lfs.customtransfer.drs.concurrent=true +lfs.customtransfer.drs.path=git-drs +lfs.customtransfer.drs.default-remote=calypr-public +lfs.customtransfer.drs.remote.calypr-dev.type=gen3 +lfs.customtransfer.drs.remote.calypr-dev.endpoint=https://calypr-public.ohsu.edu +lfs.customtransfer.drs.remote.calypr-dev.project=program-project +lfs.customtransfer.drs.remote.calypr-dev.bucket=my-bucket + +``` + +## Further Reading + +- [Git DRS Quick Start](quickstart.md) -- User guide for getting started +- [Troubleshooting](troubleshooting.md) -- Common issues and solutions +- [Git LFS Custom Transfer Agents](https://github.com/git-lfs/git-lfs/blob/main/docs/custom-transfers.md) -- Understanding the transfer protocol +- [Git LFS Specification](https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md) -- Pointer file format details +- [Git Hooks Documentation](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) -- Understanding Git hooks diff --git a/docs/tools/git-drs/index.md b/docs/tools/git-drs/index.md new file mode 100644 index 0000000..a32c2ae --- /dev/null +++ b/docs/tools/git-drs/index.md @@ -0,0 +1,69 @@ +--- +title: Git-DRS +--- + +# Git-DRS + +Git-DRS is a Git extension for managing large files in a Gen3 Data Commons using the **Data Repository Service (DRS)** content-addressable storage model. It essentially serves as a Git-LFS (Large File Storage) replacement tailored for Gen3. + +It allows you to: +- Track large files in your git repository without bloating it. +- Store the actual file contents in a Gen3 data commons (indexed via DRS). +- Seamlessly synchronize files between your local environment and the commons. + +## Installation + +Ensure `git-drs` is installed and in your PATH. + +## Initialization + +Initialize a repository to use Git-DRS. This sets up the necessary hooks and configuration. + +```bash +git-drs init +``` + +## Basic Workflow + +1. **Add a file**: Track a large file with Git-DRS. + ```bash + git-drs add + ``` + This replaces the large file with a small pointer file in your working directory. + +2. **Push**: Upload the tracked files to the Gen3 Commons. + ```bash + git-drs push + ``` + +3. **Fetch**: Download file contents (resolving pointer files) from the Commons. + ```bash + git-drs fetch + ``` + +## Command Reference + +### `init` +Initializes `git-drs` in the current git repository. Recommended to run at the root of the repo. + +### `add ` +Tracks a file using Git-DRS. The file content is moved to a local cache, and replaced with a pointer file containing its hash and size. + +### `push` +Uploads the contents of tracked files to the configured Gen3 Commons. This usually happens automatically during `git push` if hooks are configured, but can be run manually. + +### `fetch` +Downloads the contents of tracked files from the Gen3 Commons, replacing the local pointer files with the actual data. + +### `list` +Lists the files currently tracked by Git-DRS in the project. + +### `remote` +Manage remote DRS server configurations. + +```bash +git-drs remote add +git-drs remote list +git-drs remote set +git-drs remote remove +``` diff --git a/docs/tools/git-drs/quickstart.md b/docs/tools/git-drs/quickstart.md new file mode 100644 index 0000000..91b49cb --- /dev/null +++ b/docs/tools/git-drs/quickstart.md @@ -0,0 +1,419 @@ +# Git DRS — Quick Start + +Git DRS extends [Git LFS](https://git-lfs.com/) to register and retrieve large data files from DRS-enabled platforms while keeping the familiar Git workflow. Use **Git LFS** for file tracking, fetching, and local cache management. Use **Git DRS** to configure the DRS server connection and manage cloud-backed object references for your repository. + +!!! note "Relationship to Git LFS" + `git-drs` is built *on top of* Git LFS. It uses the same [clean and smudge filter](https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes) architecture, the same `.gitattributes` tracking patterns, and a compatible [pointer file format](https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md). If you already know [`git lfs track`](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-track.adoc) and [`git lfs pull`](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-pull.adoc), the `git drs` equivalents will feel natural. + +## Prerequisites + +Before installing Git DRS, you need **Git** and **Git LFS** installed and configured on your system. + +### Install Git + +Visit [https://git-scm.com](https://git-scm.com) to download and install Git for your operating system. + +### Install Git LFS + +=== "macOS" + **Install using Homebrew** + ```bash + brew install git-lfs + ``` + +=== "Linux" + **Install via Package Manager** + + === "Debian/Ubuntu" + ```bash + sudo apt-get install git-lfs + ``` + + === "RHEL/CentOS" + ```bash + sudo yum install git-lfs + ``` + + === "Fedora" + ```bash + sudo dnf install git-lfs + ``` + +=== "Windows" + **Download and Run Installer** + + Download the latest [Git LFS Windows installer](https://github.com/git-lfs/git-lfs/releases/latest) and follow the setup instructions. + +**Initialize Git LFS** + +Run the following command in your terminal to complete the setup: + +```bash +git lfs install --skip-smudge +``` + +!!! tip + The `--skip-smudge` option prevents automatic downloading of all LFS files during clone/checkout, giving you control over which files to download. + +For more details, see [Getting Started with Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage) on GitHub Docs. + +## Install Git DRS + +Use the project installer after Git LFS is installed: + +```bash +/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/calypr/git-drs/refs/heads/main/install.sh)" -- $GIT_DRS_VERSION +``` + +### Update PATH + +Ensure git-drs is on your path: + +```bash +echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bash_profile +source ~/.bash_profile +``` + +## Download Gen3 API Credentials + +To use Git DRS, you need to configure it with API credentials downloaded from the [Profile page](https://calypr-public.ohsu.edu/Profile). + +![Gen3 Profile page](../../images/profile.png) + +1. Log into the Gen3 data commons at [https://calypr-public.ohsu.edu/](https://calypr-public.ohsu.edu/) +2. Navigate to your Profile page +3. Click "Create API Key" + +![Gen3 API Key](../../images/api-key.png) + +4. Download the JSON credentials file + +![Gen3 Credentials](../../images/credentials.png) + +5. Save it in a secure location (e.g., `~/.gen3/credentials.json`) + +!!! warning "Credential Expiration" + API credentials expire after 30 days. You'll need to download new credentials and refresh your Git DRS configuration regularly. + +## New Repository Setup + +If you're creating a new project or setting up a repository for the first time: + +### 1. Clone or Create Repository + +```bash +git clone https://github.com/your-org/your-data-repo.git +cd your-data-repo +``` + +Or create a new repository: + +```bash +mkdir MyNewCalyprProject +cd MyNewCalyprProject +git init +``` + +### 2. Initialize Git DRS + +```bash +git drs init +``` + +This configures Git hooks and prepares the repository for DRS-backed files — similar to running [`git lfs install`](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-install.adoc) at the repo level. + +### 3. Get Project Details + +Contact your data coordinator at `support@calypr.org` for: + +- DRS server URL (e.g., `https://calypr-public.ohsu.edu`) +- Project ID (format: `-`) +- Bucket name + +### 4. Add Remote Configuration + +```bash +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +!!! note + Since this is your first remote, it automatically becomes the default. No need to run `git drs remote set`. + +### 5. Verify Configuration + +```bash +git drs remote list +``` + +Output: +``` +* production gen3 https://calypr-public.ohsu.edu +``` + +The `*` indicates this is the default remote. + +### Directory Structure + +An initialized project will look something like this: + +``` +/ +├── .gitattributes +├── .gitignore +├── META/ +│ ├── ResearchStudy.ndjson +│ ├── DocumentReference.ndjson +│ └── .ndjson +├── data/ +│ ├── file1.bam +│ └── file2.fastq.gz +``` + +## Track, Add, Commit, and Push + +### Track Large Files with Git LFS + +Use Git LFS to select which files should be stored as LFS objects. Git DRS works with the tracking patterns you configure via Git LFS: + +```bash +git lfs track "*.bam" +git add .gitattributes +git commit -m "Track BAM files with Git LFS" +``` + +For more details, see the [Git LFS tracking documentation](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-track.adoc). + +### Add, Commit, and Push Data + +Once files are tracked with Git LFS, use standard Git commands to add and commit. During `git push`, Git LFS uploads large objects to the LFS server while **Git DRS automatically registers them with the configured DRS server** via its pre-push hook. + +```bash +# Add your file +git add myfile.bam + +# Verify LFS is tracking it +git lfs ls-files + +# Commit and push +git commit -m "Add data file" +git push +``` + +!!! note "What Happens Behind the Scenes" + The `git push` triggers Git LFS transfer hooks. Git DRS intercepts this flow to register each LFS object with your DRS server (e.g., gen3/indexd), making the file discoverable via DRS IDs. You don't need to run any extra commands. The process: + + 1. Git DRS creates DRS records for each tracked file + 2. Files are uploaded to the configured S3 bucket + 3. DRS URIs are registered in the Gen3 system + 4. Pointer files are committed to the repository + +For background on the Git LFS transfer flow, see the [Git LFS overview](https://git-lfs.com/) and the [Git LFS push documentation](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-push.adoc). + +### Download Files + +Use Git LFS to download files on demand: + +```bash +# Download all files +git lfs pull + +# Download specific pattern +git lfs pull -I "*.bam" + +# Download specific directory +git lfs pull -I "data/**" +``` + +Refer to the [Git LFS pull documentation](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-pull.adoc) for filters and options. + +### Check Status and Tracked Files + +To see which files are tracked and their status, rely on Git LFS tooling: + +```bash +git lfs ls-files +``` + +The [Git LFS ls-files documentation](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-ls-files.adoc) explains the available flags and output format. + +## Clone an Existing Repository + +When you clone a repository that already uses Git DRS, the repo will contain small **pointer files** instead of full file content. You need to install Git DRS, initialize it in the clone, configure the DRS remote, and then pull file content. + +### Step 1 — Clone the Repository + +Clone as you normally would. Git LFS pointer files are checked out automatically, but large file content is **not** downloaded yet. + +```bash +git clone https://github.com/your-org/your-data-repo.git +cd your-data-repo +``` + +!!! tip "Skip LFS Downloads During Clone" + If you want to skip downloading *any* LFS content during clone (useful for large repos), use the `GIT_LFS_SKIP_SMUDGE` environment variable: + + ```bash + GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/your-org/your-data-repo.git + ``` + + See [`git lfs install --skip-smudge`](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-install.adoc) for details. + +### Step 2 — Initialize Git DRS + +Run `git drs init` inside the cloned repo to configure the DRS hooks and filters: + +```bash +git drs init +``` + +### Step 3 — Configure the DRS Remote + +Set up the DRS server connection. Your team or project documentation should provide the server URL, credentials, project, and bucket: + +```bash +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +!!! note + This step is required even if the original repository author already configured a DRS remote — remote configurations are local to each clone and are not committed to Git. + +### Step 4 — Pull File Content + +Download the actual file content using Git LFS: + +```bash +# Pull all LFS-tracked files +git lfs pull + +# Or pull specific files by pattern +git lfs pull -I "*.bam" +``` + +Refer to the [Git LFS pull documentation](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-pull.adoc) for filters and options. + +### Step 5 — Verify + +Confirm that pointer files have been replaced with full content and that DRS-tracked files are recognized: + +```bash +git lfs ls-files +``` + +A `*` next to a file indicates its content is present locally. A `-` means only the pointer is checked out. + +### Quick Reference + +```bash +# Full clone workflow — copy and paste +git clone https://github.com/your-org/your-data-repo.git +cd your-data-repo +git drs init +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +git lfs pull +git lfs ls-files +``` + +## Managing Remotes + +### Add Multiple Remotes + +You can configure multiple DRS remotes for working with development, staging, and production servers: + +```bash +# Add staging remote +git drs remote add gen3 staging \ + --cred /path/to/staging-credentials.json \ + --url https://staging.calypr.ohsu.edu \ + --project staging-project \ + --bucket staging-bucket + +# View all remotes +git drs remote list +``` + +### Switch Default Remote + +```bash +# Switch to staging for testing +git drs remote set staging + +# Switch back to production +git drs remote set production + +# Verify change +git drs remote list +``` + +### Remove a Remote + +If a remote is no longer needed, remove it by name: + +```bash +git drs remote remove staging +``` + +After removal, confirm your remaining remotes: + +```bash +git drs remote list +``` + +!!! warning + If you remove the default remote, run `git drs remote set ` to pick a new default before pushing or fetching. + +### Cross-Remote Promotion + +Transfer DRS records from one remote to another (e.g., staging to production) without re-uploading files: + +```bash +# Fetch metadata from staging +git drs fetch staging + +# Push metadata to production (no file upload since files don't exist locally) +git drs push production +``` + +This is useful when files are already in the production bucket with matching SHA256 hashes. It can also be used to re-upload files given that the files are pulled to the repo first. + +## Command Quick Reference + +| Action | Command | +|--------|---------| +| **Initialize** | `git drs init` | +| **Add remote** | `git drs remote add gen3 --cred...` | +| **View remotes** | `git drs remote list` | +| **Set default** | `git drs remote set ` | +| **Remove remote** | `git drs remote remove ` | +| **Track files** | `git lfs track "pattern"` | +| **Check tracked** | `git lfs ls-files` | +| **Add files** | `git add file.ext` | +| **Commit** | `git commit -m "message"` | +| **Push** | `git push` | +| **Download** | `git lfs pull -I "pattern"` | +| **Fetch from remote** | `git drs fetch [remote-name]` | +| **Push to remote** | `git drs push [remote-name]` | +| **Query DRS object** | `git drs query ` | +| **Check version** | `git drs version` | + +## Further Reading + +- [Troubleshooting](troubleshooting.md) — Common issues and solutions +- [Developer Guide](developer-guide.md) — Architecture, command reference, and internals +- [Git LFS Official Site](https://git-lfs.com/) +- [Git LFS Man Pages](https://github.com/git-lfs/git-lfs/tree/main/docs/man) — Complete command reference +- [Git LFS Specification](https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md) — Pointer file format and protocol +- [Git LFS Custom Transfer Agents](https://github.com/git-lfs/git-lfs/blob/main/docs/custom-transfers.md) — How Git DRS hooks into the LFS transfer flow +- [GitHub Docs: About Git Large File Storage](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage) +- [Git Attributes — Clean & Smudge Filters](https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes) diff --git a/docs/tools/git-drs/remove-files.md b/docs/tools/git-drs/remove-files.md new file mode 100644 index 0000000..ddb03ec --- /dev/null +++ b/docs/tools/git-drs/remove-files.md @@ -0,0 +1,84 @@ +--- +title: Removing Files +--- + +## 🗑️ Deleting Files and Updating Metadata + +When removing data files from your project, it's crucial to also update the manifest and associated metadata to maintain consistency. + +### 1. Remove File(s) Using `git rm` + +Use the `git rm` command to delete files and automatically update the manifest and metadata: + +```bash +git rm DATA/subject-123/vcf/sample1.vcf.gz +``` + +This command performs the following actions: + +- Removes the corresponding `entry` from `MANIFEST/`. + +!!! note + It will not: + + - Delete the specified `data` file. + - Update or remove related metadata in the `META/` directory. + +### 2. Review Changes + +After removing files, check the status of your project to see the staged changes: + +```bash +git status +``` + +This will display the files marked for deletion and any updates to the manifest. + +### 3. Update Metadata + +If you need to regenerate the metadata after file deletions, use the `forge meta init` command: + +```bash +forge meta init +``` + +!!! note + This command rebuilds the `META/` directory based on the current state of the repository, ensuring that your metadata accurately reflects the existing data files. + + If you have customized the metadata, you will need to manually remove the affected DocumentReference entries before running this command to avoid conflicts or inconsistencies. + +### 4. Commit Changes + +Once you've reviewed the changes, commit them to your local repository: + +```bash +git commit -m "Removed sample1.vcf.gz and updated associated metadata" +``` + +--- + +## 🚀 Pushing Updates to the Platform + +After committing your changes, push them to the CALYPR platform. + +### 1. Push Changes + +Use the `git push` command (which triggers the `git-drs` transfer hooks) to upload your changes: + +```bash +git push +``` + +If you need to perform metadata registration specifically, you can use `git drs push`. + +--- + +## 📌 Best Practices + +- Always use `git rm` to delete files to ensure that the Git state is properly updated. +- Use `forge meta init` to regenerate metadata when necessary, especially after significant changes to your data files. +- Regularly review your remote repository after pushing changes to confirm successful updates. + +--- + +By following these steps, you can maintain a consistent and accurate state across your data, manifest, and metadata in your CALYPR project. diff --git a/docs/tools/git-drs/troubleshooting.md b/docs/tools/git-drs/troubleshooting.md new file mode 100644 index 0000000..d256eea --- /dev/null +++ b/docs/tools/git-drs/troubleshooting.md @@ -0,0 +1,371 @@ +# Git DRS — Troubleshooting + +Common issues and solutions when working with Git DRS. + +## When to Use Which Tool + +Understanding when to use Git, Git LFS, or Git DRS commands: + +| Tool | Commands | When to Use | +|------|----------|-------------| +| **Git DRS** | `git drs init`
`git drs remote add`
`git drs remote list`
`git drs fetch`
`git drs push` | Repository and remote configuration
Setting up a new repository
Adding/managing DRS remotes
Refreshing expired credentials
Cross-remote promotion | +| **Git LFS** | `git lfs track`
`git lfs ls-files`
`git lfs pull`
`git lfs untrack` | File tracking and management
Defining which files to track
Downloading specific files
Checking file localization status | +| **Standard Git** | `git add`
`git commit`
`git push`
`git pull` | Version control operations
Normal development workflow
Git DRS runs automatically in background | + +## Authentication Errors + +### Error: `Upload error: 403 Forbidden` or `401 Unauthorized` + +**Cause**: Expired or invalid credentials + +**Solution**: + +```bash +# Download new credentials from your data commons +# Then refresh them by re-adding the remote +git drs remote add gen3 production \ + --cred /path/to/new-credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +**Prevention**: + +- Credentials expire after 30 days +- Set a reminder to refresh them regularly + +### Error: `Upload error: 503 Service Unavailable` + +**Cause**: DRS server is temporarily unavailable or credentials expired + +**Solutions**: + +1. Wait and retry the operation +2. Refresh credentials: + ```bash + git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket + ``` +3. If persistent, download new credentials from the data commons + +## Network Errors + +### Error: `net/http: TLS handshake timeout` + +**Cause**: Network connectivity issues + +**Solution**: + +- Simply retry the command +- These are usually temporary network issues + +### Error: Git push timeout during large file uploads + +**Cause**: Long-running operations timing out + +**Solution**: Add to `~/.ssh/config`: + +``` +Host github.com + TCPKeepAlive yes + ServerAliveInterval 30 +``` + +## File Tracking Issues + +### Files Not Being Tracked by LFS + +**Symptoms**: + +- Large files committed directly to Git +- `git lfs ls-files` doesn't show your files + +**Solution**: + +```bash +# Check what's currently tracked +git lfs track + +# Track your file type +git lfs track "*.bam" +git add .gitattributes + +# Remove from Git and re-add +git rm --cached large-file.bam +git add large-file.bam +git commit -m "Track large file with LFS" +``` + +### Error: `[404] Object does not exist on the server` + +**Symptoms**: + +- After clone, git pull fails + +**Solution**: + +```bash +# Confirm repo has complete configuration +git drs remote list + +# Initialize your git drs project +git drs init + +# Add remote configuration +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket + +# Attempt git pull again +git lfs pull -I path/to/file +``` + +### Files Won't Download + +**Cause**: Files may not have been properly uploaded or DRS records missing + +**Solution**: + +```bash +# Check repository status +git drs remote list + +# Try pulling with verbose output +git lfs pull -I "problematic-file*" --verbose + +# Check logs +cat .git/drs/*.log +``` + +## Configuration Issues + +### Empty or Incomplete Configuration + +**Error**: `git drs remote list` shows empty or incomplete configuration + +**Cause**: Repository not properly initialized or no remotes configured + +**Solution**: + +```bash +# Initialize repository if needed +git drs init + +# Add Gen3 remote +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket + +# Verify configuration +git drs remote list +``` + +### Configuration Exists but Commands Fail + +**Cause**: Mismatched configuration between global and local settings, or expired credentials + +**Solution**: + +```bash +# Check configuration +git drs remote list + +# Refresh credentials by re-adding the remote +git drs remote add gen3 production \ + --cred /path/to/new-credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +## Remote Configuration Issues + +### Error: `no default remote configured` + +**Cause**: Repository initialized but no remotes added yet + +**Solution**: + +```bash +# Add your first remote (automatically becomes default) +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +### Error: `default remote 'X' not found` + +**Cause**: Default remote was deleted or configuration is corrupted + +**Solution**: + +```bash +# List available remotes +git drs remote list + +# Set a different remote as default +git drs remote set staging + +# Or add a new remote +git drs remote add gen3 production \ + --cred /path/to/credentials.json \ + --url https://calypr-public.ohsu.edu \ + --project my-project \ + --bucket my-bucket +``` + +### Commands Using Wrong Remote + +**Cause**: Default remote is not the one you want to use + +**Solution**: + +```bash +# Check current default +git drs remote list + +# Option 1: Change default remote +git drs remote set production + +# Option 2: Specify remote for single command +git drs push staging +git drs fetch production +``` + +## Undoing Changes + +### Untrack LFS Files + +If you accidentally tracked the wrong files: + +```bash +# See current tracking +git lfs track + +# Remove incorrect pattern +git lfs untrack "wrong-dir/**" + +# Add correct pattern +git lfs track "correct-dir/**" + +# Stage the changes +git add .gitattributes +git commit -m "Fix LFS tracking patterns" +``` + +### Undo Git Add + +Remove files from staging area: + +```bash +# Check what's staged +git status + +# Unstage specific files +git restore --staged file1.bam file2.bam + +# Unstage all files +git restore --staged . +``` + +### Undo Last Commit + +To retry a commit with different files: + +```bash +# Undo last commit, keep files in working directory +git reset --soft HEAD~1 + +# Or undo and unstage files +git reset HEAD~1 + +# Or completely undo commit and changes (BE CAREFUL!) +git reset --hard HEAD~1 +``` + +### Remove Files from LFS History + +If you committed large files directly to Git by mistake: + +```bash +# Remove from Git history (use carefully!) +git filter-branch --tree-filter 'rm -f large-file.dat' HEAD + +# Then track properly with LFS +git lfs track "*.dat" +git add .gitattributes +git add large-file.dat +git commit -m "Track large file with LFS" +``` + +## Diagnostic Commands + +### Check System Status + +```bash +# Git DRS version and help +git-drs version +git-drs --help + +# Configuration +git drs remote list + +# Repository status +git status +git lfs ls-files +``` + +### Test Connectivity + +```bash +# Test basic Git operations +git lfs pull --dry-run + +# Test DRS configuration +git drs remote list +``` + +### Log Analysis + +When reporting issues, include: + +```bash +# System information +git-drs version +git lfs version +git --version + +# Configuration +git drs remote list +``` + +## Prevention Best Practices + +1. **Refresh credentials regularly** -- Credentials expire after 30 days. Set a calendar reminder to download and configure new credentials before they expire. + +2. **Test in small batches** -- Don't commit hundreds of files at once. Start with a few files to ensure your configuration works correctly. + +3. **Verify tracking** -- Always check `git lfs ls-files` after adding files to ensure they're being tracked by LFS. + +4. **Use .gitignore** -- Prevent accidental commits of temporary files, build artifacts, and other files that shouldn't be in the repository. + +5. **Monitor repository size** -- Keep an eye on `.git` directory size. If it grows unexpectedly, you may have committed large files directly to Git instead of through LFS. + +## Getting Help + +For issues not covered in this guide: + +- Check the [Git DRS Quick Start](quickstart.md) for setup instructions +- Review the [Developer Guide](developer-guide.md) for advanced usage +- Consult the [Git LFS FAQ](https://github.com/git-lfs/git-lfs/wiki/FAQ) +- See [GitHub's Git LFS documentation](https://docs.github.com/en/repositories/working-with-files/managing-large-files) diff --git a/docs/tools/grip/clients.md b/docs/tools/grip/clients.md new file mode 100644 index 0000000..986755e --- /dev/null +++ b/docs/tools/grip/clients.md @@ -0,0 +1,141 @@ +--- +title: Client Library +menu: + main: + identifier: clients + weight: 25 +--- + + +# Getting Started + +GRIP has an API for making graph queries using structured data. Queries are defined using a series of step [operations](./queries/index.md). + +## Install the Python Client + +Available on [PyPI](https://pypi.org/project/gripql/). + +``` +pip install gripql +``` + +Or install the latest development version: + +``` +pip install "git+https://github.com/bmeg/grip.git#subdirectory=gripql/python" +``` + + +## Using the Python Client + +Let's go through the features currently supported in the python client. + +First, import the client and create a connection to an GRIP server: + +```python +import gripql +G = gripql.Connection("https://bmeg.io").graph("bmeg") +``` + +Some GRIP servers may require authorizaiton to access its API endpoints. The client can be configured to pass +authorization headers in its requests. + +```python +import gripql + +# Basic Auth Header - {'Authorization': 'Basic dGVzdDpwYXNzd29yZA=='} +G = gripql.Connection("https://bmeg.io", user="test", password="password").graph("bmeg") +# + +# Bearer Token - {'Authorization': 'Bearer iamnotarealtoken'} +G = gripql.Connection("https://bmeg.io", token="iamnotarealtoken").graph("bmeg") + +# OAuth2 / Custom - {"OauthEmail": "fake.user@gmail.com", "OauthAccessToken": "iamnotarealtoken", "OauthExpires": 1551985931} +G = gripql.Connection("https://bmeg.io", credential_file="~/.grip_token.json").graph("bmeg") +``` + +Now that we have a connection to a graph instance, we can use this to make all of our queries. + +One of the first things you probably want to do is find some vertex out of all of the vertexes available in the system. In order to do this, we need to know something about the vertex we are looking for. To start, let's see if we can find a specific gene: + +```python +result = G.V().hasLabel("Gene").has(gripql.eq("symbol", "TP53")).execute() +print(result) +``` + +A couple things about this first and simplest query. We start with `O`, our grip client instance connected to the "bmeg" graph, and create a new query with ``. This query is now being constructed. You can chain along as many operations as you want, and nothing will actually get sent to the server until you print the results. + +Once we make this query, we get a result: + +```python +[ + { + u'_id': u'ENSG00000141510', + u'_label': u'Gene' + u'end': 7687550, + u'description': u'tumor protein p53 [Source:HGNC Symbol%3BAcc:HGNC:11998]', + u'symbol': u'TP53', + u'start': 7661779, + u'seqId': u'17', + u'strand': u'-', + u'id': u'ENSG00000141510', + u'chromosome': u'17' + } +] +``` + +This represents the vertex we queried for above. All vertexes in the system will have a similar structure, basically: + +* _\_id_: This represents the global identifier for this vertex. In order to draw edges between different vertexes from different data sets we need an identifier that can be constructed from available data. Often, the `_id` will be the field that you query on as a starting point for a traversal. +* _\_label_: The label represents the type of the vertex. All vertexes with a given label will share many property keys and edge labels, and form a logical group within the system. + +The data on a query result can be accessed as properties on the result object; for example `result[0].data.symbol` would return: + +```python +u'TP53' +``` + +You can also do a `has` query with a list of items using `gripql.within([...])` (other conditions exist, see the `Conditions` section below): + +```python +result = G.V().hasLabel("Gene").has(gripql.within("symbol", ["TP53", "BRCA1"])).render({"_id": "_id", "symbol":"symbol"}).execute() +print(result) +``` + +This returns both Gene vertexes: + +``` +[ + {u'symbol': u'TP53', u'_id': u'ENSG00000141510'}, + {u'symbol': u'BRCA1', u'_id': u'ENSG00000012048'} +] +``` + +Once you are on a vertex, you can travel through that vertex's edges to find the vertexes it is connected to. Sometimes you don't even need to go all the way to the next vertex, the information on the edge between them may be sufficient. + +Edges in the graph are directional, so there are both incoming and outgoing edges from each vertex, leading to other vertexes in the graph. Edges also have a _label_, which distinguishes the kind of connections different vertexes can have with one another. + +Starting with gene TP53, and see what kind of other vertexes it is connected to. + +```python +result = G.V().hasLabel("Gene").has(gripql.eq("symbol", "TP53")).in_("TranscriptFor")render({"id": "_id", "label":"_label"}).execute() +print(result) +``` + +Here we have introduced a couple of new steps. The first is `.in_()`. This starts from wherever you are in the graph at the moment and travels out along all the incoming edges. +Additionally, we have provided `TranscriptFor` as an argument to `.in_()`. This limits the returned vertices to only those connected to the `Gene` verticies by edges labeled `TranscriptFor`. + + +``` +[ + {u'_label': u'Transcript', u'_id': u'ENST00000413465'}, + {u'_label': u'Transcript', u'_id': u'ENST00000604348'}, + ... +] +``` + +View a list of all available query operations [here](./queries/index.md). + +### Using the command line + +Grip command line syntax is defined at gripql/javascript/gripql.js diff --git a/docs/tools/grip/commands/create.md b/docs/tools/grip/commands/create.md new file mode 100644 index 0000000..bde4aa8 --- /dev/null +++ b/docs/tools/grip/commands/create.md @@ -0,0 +1,27 @@ + +--- +title: create + +menu: + main: + parent: commands + weight: 2 +--- + +# `create` + +## Usage + +```bash +gripql-cli create --host +``` + +- ``: The name of the graph to be created (required). +- `--host `: The URL of the GripQL server (default is "localhost:8202"). + +## Example + +```bash +gripql-cli create my_new_graph --host myserver.com:8202 +``` + diff --git a/docs/tools/grip/commands/delete.md b/docs/tools/grip/commands/delete.md new file mode 100644 index 0000000..e4deeab --- /dev/null +++ b/docs/tools/grip/commands/delete.md @@ -0,0 +1,42 @@ +--- +title: delete +menu: + main: + parent: commands + weight: 3 +--- + +# `delete` Command + +## Usage + +```bash +gripql-cli delete --host --file --edges --vertices +``` + +### Options + +- ``: Name of the graph (required) +- `--host `: GripQL server URL (default: "localhost:8202") +- `--file `: Path to a JSON file containing data to delete +- `--edges `: Comma-separated list of edge IDs to delete (ignored if JSON file is provided) +- `--vertices `: Comma-separated list of vertex IDs to delete (ignored if JSON file is provided) + +## Example + +```bash +gripql-cli delete my_graph --host myserver.com:8202 --edges edge1,edge2 --vertices vertex3,vertex4 +``` + +## JSON File Format + +JSON file format for data to be deleted: + +```json +{ + "graph": "graph_name", + "edges": ["list of edge ids"], + "vertices": ["list of vertex ids"] +} +``` + diff --git a/docs/tools/grip/commands/drop.md b/docs/tools/grip/commands/drop.md new file mode 100644 index 0000000..c43aa0b --- /dev/null +++ b/docs/tools/grip/commands/drop.md @@ -0,0 +1,14 @@ +--- +title: drop + +menu: + main: + parent: commands + weight: 4 +--- + +``` +grip drop +``` + +Deletes a graph. diff --git a/docs/tools/grip/commands/er.md b/docs/tools/grip/commands/er.md new file mode 100644 index 0000000..28d34ee --- /dev/null +++ b/docs/tools/grip/commands/er.md @@ -0,0 +1,49 @@ +--- +title: er + +menu: + main: + parent: commands + weight: 6 +--- + +``` +grip er +``` + +The *External Resource* system allows GRIP to plug into existing data systems and +integrate them into queriable graphs. The `grip er` sub command acts as a client +to the external resource plugin proxies, issues command and displays the results. +This is often useful for debugging external resources before making them part of +an actual graph. + + +List collections provided by external resource +``` +grip er list +``` + +Get info about a collection +``` +grip er info +``` + +List ids from a collection +``` +grip er ids +``` + +List rows from a collection +``` +grip er rows +``` + +List rows with field match +``` +grip get +``` + +List rows with field match +``` +grip er query +``` diff --git a/docs/tools/grip/commands/list.md b/docs/tools/grip/commands/list.md new file mode 100644 index 0000000..f018b6a --- /dev/null +++ b/docs/tools/grip/commands/list.md @@ -0,0 +1,40 @@ +--- +title: list + +menu: + main: + parent: commands + weight: 3 +--- + +The `list tables` command is used to display all available tables in the grip server. Each table is represented by its source, name, fields, and link map. Here's a breakdown of how to use this command: + +- **Usage:** `gripql list tables` +- **Short Description:** List all available tables in the grip server. +- **Long Description:** This command connects to the grip server and retrieves information about all available tables. It then prints each table's source, name, fields, and link map to the console. +- **Arguments:** None +- **Flags:** + - `--host`: The URL of the grip server (default: "localhost:8202") + +## `gripql list graphs` Command Documentation + +The `list graphs` command is used to display all available graphs in the grip server. Here's a breakdown of how to use this command: + +- **Usage:** `gripql list graphs` +- **Short Description:** List all available graphs in the grip server. +- **Long Description:** This command connects to the grip server and retrieves information about all available graphs. It then prints each graph's name to the console. +- **Arguments:** None +- **Flags:** + - `--host`: The URL of the grip server (default: "localhost:8202") + +## `gripql list labels` Command Documentation + +The `list labels` command is used to display all available vertex and edge labels in a specific graph. Here's a breakdown of how to use this command: + +- **Usage:** `gripql list labels ` +- **Short Description:** List the vertex and edge labels in a specific graph. +- **Long Description:** This command takes one argument, the name of the graph, and connects to the grip server. It retrieves information about all available vertex and edge labels in that graph and prints them to the console in JSON format. +- **Arguments:** + - ``: The name of the graph to list labels for. +- **Flags:** + - `--host`: The URL of the grip server (default: "localhost:8202") \ No newline at end of file diff --git a/docs/tools/grip/commands/mongoload.md b/docs/tools/grip/commands/mongoload.md new file mode 100644 index 0000000..e9372d7 --- /dev/null +++ b/docs/tools/grip/commands/mongoload.md @@ -0,0 +1,12 @@ +--- +title: mongoload + +menu: + main: + parent: commands + weight: 4 +--- + +``` +grip mongoload +``` diff --git a/docs/tools/grip/commands/query.md b/docs/tools/grip/commands/query.md new file mode 100644 index 0000000..bb4f7ec --- /dev/null +++ b/docs/tools/grip/commands/query.md @@ -0,0 +1,20 @@ +--- +title: query + +menu: + main: + parent: commands + weight: 2 +--- + +``` +grip query +``` + +Run a query on a graph. + +Examples +```bash +grip query pc12 'V().hasLabel("Pathway").count()' +``` + diff --git a/docs/tools/grip/commands/server.md b/docs/tools/grip/commands/server.md new file mode 100644 index 0000000..fb7e82a --- /dev/null +++ b/docs/tools/grip/commands/server.md @@ -0,0 +1,53 @@ +--- +title: server +menu: + main: + parent: commands + weight: 1 +--- + +# `server` +The server command starts up a graph server and waits for incoming requests. + +## Default Configuration +If invoked with no arguments or config files, GRIP will start up in embedded mode, using a Badger based graph driver. + +## Networking +By default the GRIP server operates on 2 ports, `8201` is the HTTP based interface. Port `8202` is a GRPC based interface. Python, R and Javascript clients are designed to connect to the HTTP interface on `8201`. The `grip` command will often use port `8202` in order to complete operations. For example if you call `grip list graphs` it will contact port `8202`, rather then using the HTTP port. This means that if you are working with a server that is behind a firewall, and only the HTTP port is available, then the grip command line program will not be able to issue commands, even if the server is visible to client libraries. + +## CLI Usage +The `server` command can take several flags for configuration: +- `--config` or `-c` - Specifies a YAML config file with server settings. This overwrites all other settings. Defaults to "" (empty string). +- `--http-port` - Sets the port used by the HTTP interface. Defaults to "8201". +- `--rpc-port` - Sets the port used by the GRPC interface. Defaults to "8202". +- `--read-only` - Start server in read-only mode. Defaults to false. +- `--log-level` or `--log-format` - Set logging level and format, respectively. Defaults are "info" for log level and "text" for format. +- `--log-requests` - Log all requests. Defaults to false. +- `--verbose` - Sets the log level to debug if true. +- `--plugins` or `-p` - Specifies a directory with GRIPPER plugins to load. If not specified, no plugins will be loaded by default. +- `--driver` or `-d` - Specifies the default driver for graph storage. Defaults to "badger". Other possible options are: "pebble", "mongo", "grids", and "sqlite". +- `--endpoint` or `-w` - Load a web endpoint plugin. Use multiple times to load multiple plugins. The format is key=value where key is the plugin name and value is the configuration string for the plugin. +- `--endpoint-config` or `-l` - Configure a loaded web endpoint plugin. Use multiple times to configure multiple plugins. The format is key=value where key is in the form 'pluginname:key' and value is the configuration value for that key. +- `--er` or `-e` - Set GRIPPER source addresses. This flag can be used multiple times to specify multiple addresses. Defaults to an empty map. + +## Examples + +```bash +# Load server with a specific config file +grip server --config /path/to/your_config.yaml + +# Set the HTTP port to 9001 +grip server --http-port 9001 + +# Start in read-only mode +grip server --read-only + +# Enable verbose logging (sets log level to debug) +grip server --verbose + +# Load a web endpoint plugin named 'foo' with configuration string 'config=value' +grip server --endpoint foo=config=value + +# Configure the loaded 'foo' web endpoint plugin, setting its key 'key1' to value 'val1' +grip server --endpoint-config foo:key1=val1 +``` diff --git a/docs/tools/grip/databases.md b/docs/tools/grip/databases.md new file mode 100644 index 0000000..3ad6b01 --- /dev/null +++ b/docs/tools/grip/databases.md @@ -0,0 +1,120 @@ +--- +title: Database Configuration +menu: + main: + identifier: Databases + weight: 20 +--- + + +# Embedded Key Value Stores + +GRIP supports storing vertices and edges in a variety of key-value stores including: + + * [Pebble](https://github.com/cockroachdb/pebble) + * [Badger](https://github.com/dgraph-io/badger) + * [BoltDB](https://github.com/boltdb/bolt) + * [LevelDB](https://github.com/syndtr/goleveldb) + +Config: + +```yaml +Default: kv + +Driver: + kv: + Badger: grip.db +``` + +---- + +# MongoDB + +GRIP supports storing vertices and edges in [MongoDB][mongo]. + +Config: + +```yaml +Default: mongo + +Drivers: + mongo: + MongoDB: + URL: "mongodb://localhost:27000" + DBName: "gripdb" + Username: "" + Password: "" + UseCorePipeline: False + BatchSize: 0 +``` + +[mongo]: https://www.mongodb.com/ + +`UseCorePipeline` - Default is to use Mongo pipeline API to do graph traversals. +By enabling `UseCorePipeline`, GRIP will do the traversal logic itself, only using +Mongo for graph storage. + +`BatchSize` - For core engine operations, GRIP dispatches element lookups in +batches to minimize query overhead. If missing from config file (which defaults to 0) +the engine will default to 1000. + +---- + + +# GRIDS + +This is an indevelopment high performance graph storage system. + +Config: + +```yaml +Default: db + +Drivers: + db: + Grids: grip-grids.db + +``` + +---- + +# PostgreSQL + +GRIP supports storing vertices and edges in [PostgreSQL][psql]. + +Config: + +```yaml +Default: psql + +Drivers: + psql: + PSQL: + Host: localhost + Port: 15432 + User: "" + Password: "" + DBName: "grip" + SSLMode: disable +``` + +[psql]: https://www.postgresql.org/ + +--- + +# SQLite + +GRIP supports storing vertices and edges in [SQLite] + +Config: + +```yaml +Default: sqlite + +Drivers: + sqlite: + Sqlite: + DBName: tester/sqliteDB +``` + +[psql]: https://sqlite.org/ diff --git a/docs/tools/grip/developer/architecture.d2 b/docs/tools/grip/developer/architecture.d2 new file mode 100644 index 0000000..ecbd726 --- /dev/null +++ b/docs/tools/grip/developer/architecture.d2 @@ -0,0 +1,90 @@ + + +gripql-python: "gripql/python" { + text: |md +# gripql + +Python library +| +} + +gripql-python -> gripql.http + +grip-client : "cmd/" { + graph { + create + drop + stream + list + schema + } + + data { + kvload + load + dump + mongoload + query + delete + } + + config { + mapping + plugin + info + } + + jobs { + job + } +} + +grip-client -> gripql.grpc + +gripql : "gripql/" { + + text: |md +Protobuf defined code +| + grpc + grpc-gateway + + http -> grpc-gateway + grpc-gateway -> grpc : protobuf via network + + http -> grpc-dgw +} + + +gripql.grpc -> server +gripql.grpc-dgw -> server + +server : "server/" { + +} + +server -> pipeline + +pipeline { + gripql-parser + compiler +} + +gdbi { + mongo + mongo-core + pebble +} + +pipeline.compiler -> gdbi + +server -> jobs + +jobs { + store + search + drivers : { + opensearch + flat file + } +} \ No newline at end of file diff --git a/docs/tools/grip/graphql/graph_schemas.md b/docs/tools/grip/graphql/graph_schemas.md new file mode 100644 index 0000000..f68bff4 --- /dev/null +++ b/docs/tools/grip/graphql/graph_schemas.md @@ -0,0 +1,37 @@ +--- +title: Graph Schemas +menu: + main: + parent: graphql + weight: 30 +--- + +# Graph Schemas + +Most GRIP based graphs are not required to have a strict schema. However, GraphQL requires +a graph schema as part of it's API. To utilize the GraphQL endpoint, there must be a +Graph Schema provided to be used by the GRIP engine to determine how to render a GraphQL endpoint. +Graph schemas are themselves an instance of a graph. As such, they can be traversed like any other graph. +The schemas are automatically added to the database following the naming pattern. `{graph-name}__schema__` + +## Get the Schema of a Graph + +The schema of a graph can be accessed via a GET request to `/v1/graph/{graph-name}/schema` + +Alternatively, you can use the grip CLI. `grip schema get {graph-name}` + +## Post a graph schema + +A schema can be attached to an existing graph via a POST request to `/v1/graph/{graph-name}/schema` + +Alternatively, you can use the grip CLI. `grip schema post [graph_name] --jsonSchema {file}` + +Schemas must be loaded as a json file in JSON schema format. see [jsonschema](https://json-schema.org/) spec for more details + +## Raw bulk loading + +Once a schema is attached to a graph, raw json records can be loaded directly to grip without having to be in native grip vertex/edge format. +Schema validation is enforced when using this POST `/v1/rawJson` method. + +A grip CLI alternative is also available with `grip jsonload [ndjson_file_path] [graph_name]` +See https://github.com/bmeg/grip/blob/develop/conformance/tests/ot_bulk_raw.py for a full example using gripql python package. diff --git a/docs/tools/grip/graphql/graphql.md b/docs/tools/grip/graphql/graphql.md new file mode 100644 index 0000000..1de2f34 --- /dev/null +++ b/docs/tools/grip/graphql/graphql.md @@ -0,0 +1,29 @@ +--- +title: GraphQL +menu: + main: + parent: graphql + weight: 25 +--- + +# GraphQL + +Grip graphql tools are defined as go standard library plugins and are located at https://github.com/bmeg/grip-graphql. +A schema based approach was used for defining read plugins. + +## Json Schema + +grip also supports using jsonschema with hypermedia extensions. Given an existing graph called TEST + +``` +./grip schema post TEST --jsonSchema schema.json +``` + +This schema will attach to the TEST graph, and subsequent calls to the bulkAddRaw method with raw Json +as defined by the attached the jsonschema will load directly into grip. + +see conformance/tests/ot_bulk_raw.py for an example + +## Future work + +In the future, autogenerated json schema may be added back to grip to continue to support graphql queries. Currently there is not support for graphql in base Grip without using the plugin repos specified above. diff --git a/docs/tools/grip/gripper/graphmodel.md b/docs/tools/grip/gripper/graphmodel.md new file mode 100644 index 0000000..cd678cf --- /dev/null +++ b/docs/tools/grip/gripper/graphmodel.md @@ -0,0 +1,255 @@ +--- +title: Graph Model + +menu: + main: + parent: gripper + weight: 3 +--- + +# GRIPPER + +GRIP Plugable External Resources + +## Graph Model + +The graph model describes how GRIP will access multiple gripper servers. The mapping +of these data resources is done using a graph. The `vertices` represent how each vertex +type will be mapped, and the `edges` describe how edges will be created. The `_id` +of each vertex represents the prefix domain of all vertices that can be found in that +source. + +The `sources` referenced by the graph are provided to GRIP at run time, each named resource is a +different GRIPPER plugin that abstracts an external resource. +The `vertices` section describes how different collections +found in these sources will be turned into Vertex found in the graph. Finally, the +`edges` section describes the different kinds of rules that can be used build the +edges in the graph. + +Edges can be built from two rules `fieldToField` and `edgeTable`. In `fieldToField`, +a field value found in one vertex can be used to look up matching destination vertices +by using an indexed field found in another collection that has been mapped to a vertex. +For `edgeTable` connections, there is a single collection that represents a connection between +two other collections that have been mapped to vertices. + +## Runtime External Resource Config + +External resources are passed to GRIP as command line options. For the command line: + +``` +grip server config.yaml --er tableServer=localhost:50051 --er pfb=localhost:50052 +``` + +`tableServer` is a ER plugin that serves table data (see `gripper/test-graph`) +while `pfb` parses PFB based files (see https://github.com/bmeg/grip_pfb ) + +The `config.yaml` is + +``` +Default: badger + +Drivers: + badger: + Badger: grip-badger.db + + swapi-driver: + Gripper: + ConfigFile: ./swapi.yaml + Graph: swapi + +``` + +This runs with a default `badger` based driver, but also provides a GRIPPER based +graph from the `swapi` mapping (see example graph map below). + +## Example graph map + +``` +vertices: + - _id: "Character:" + _label: Character + source: tableServer + collection: Character + + - _id: "Planet:" + _label: Planet + collection: Planet + source: tableServer + + - _id: "Film:" + _label: Film + collection: Film + source: tableServer + + - _id: "Species:" + _label: Species + source: tableServer + collection: Species + + - _id: "Starship:" + _label: Starship + source: tableServer + collection: Starship + + - _id: "Vehicle:" + _label: Vehicle + source: tableServer + collection: Vehicle + +edges: + - _id: "homeworld" + _from: "Character:" + _to: "Planet:" + _label: homeworld + fieldToField: + fromField: $.homeworld + toField: $.id + + - _id: species + _from: "Character:" + _to: "Species:" + _label: species + fieldToField: + fromField: $.species + toField: $.id + + - _id: people + _from: "Species:" + _to: "Character:" + _label: people + edgeTable: + source: tableServer + collection: speciesCharacter + fromField: $.from + toField: $.to + + - _id: residents + _from: "Planet:" + _to: "Character:" + _label: residents + edgeTable: + source: tableServer + collection: planetCharacter + fromField: $.from + toField: $.to + + - _id: filmVehicles + _from: "Film:" + _to: "Vehicle:" + _label: "vehicles" + edgeTable: + source: tableServer + collection: filmVehicles + fromField: "$.from" + toField: "$.to" + + - _id: vehicleFilms + _to: "Film:" + _from: "Vehicle:" + _label: "films" + edgeTable: + source: tableServer + collection: filmVehicles + toField: "$.from" + fromField: "$.to" + + - _id: filmStarships + _from: "Film:" + _to: "Starship:" + _label: "starships" + edgeTable: + source: tableServer + collection: filmStarships + fromField: "$.from" + toField: "$.to" + + - _id: starshipFilms + _to: "Film:" + _from: "Starship:" + _label: "films" + edgeTable: + source: tableServer + collection: filmStarships + toField: "$.from" + fromField: "$.to" + + - _id: filmPlanets + _from: "Film:" + _to: "Planet:" + _label: "planets" + edgeTable: + source: tableServer + collection: filmPlanets + fromField: "$.from" + toField: "$.to" + + - _id: planetFilms + _to: "Film:" + _from: "Planet:" + _label: "films" + edgeTable: + source: tableServer + collection: filmPlanets + toField: "$.from" + fromField: "$.to" + + - _id: filmSpecies + _from: "Film:" + _to: "Species:" + _label: "species" + edgeTable: + source: tableServer + collection: filmSpecies + fromField: "$.from" + toField: "$.to" + + - _id: speciesFilms + _to: "Film:" + _from: "Species:" + _label: "films" + edgeTable: + source: tableServer + collection: filmSpecies + toField: "$.from" + fromField: "$.to" + + - _id: filmCharacters + _from: "Film:" + _to: "Character:" + _label: characters + edgeTable: + source: tableServer + collection: filmCharacters + fromField: "$.from" + toField: "$.to" + + - _id: characterFilms + _from: "Character:" + _to: "Film:" + _label: films + edgeTable: + source: tableServer + collection: filmCharacters + toField: "$.from" + fromField: "$.to" + + - _id: characterStarships + _from: "Character:" + _to: "Starship:" + _label: "starships" + edgeTable: + source: tableServer + collection: characterStarships + fromField: "$.from" + toField: "$.to" + + - _id: starshipCharacters + _to: "Character:" + _from: "Starship:" + _label: "pilots" + edgeTable: + source: tableServer + collection: characterStarships + toField: "$.from" + fromField: "$.to" +``` diff --git a/docs/tools/grip/gripper/gripper.md b/docs/tools/grip/gripper/gripper.md new file mode 100644 index 0000000..954f583 --- /dev/null +++ b/docs/tools/grip/gripper/gripper.md @@ -0,0 +1,22 @@ +--- +title: Intro + +menu: + main: + parent: gripper + weight: 1 +--- + +# GRIPPER +## GRIP Plugin External Resources + +GRIP Plugin External Resources (GRIPPERs) are GRIP drivers that take external +resources and allow GRIP to access them are part of a unified graph. +To integrate new resources into the graph, you +first deploy griper proxies that plug into the external resources. They are unique +and configured to access specific resources. These provide a view into external +resources as a series of document collections. For example, an SQL gripper would +plug into an SQL server and provide the tables as a set of collections with each +every row a document. A gripper is written as a gRPC server. + +![GIPPER Architecture](../../../images/gripper_architecture.png) diff --git a/docs/tools/grip/gripper/proxy.md b/docs/tools/grip/gripper/proxy.md new file mode 100644 index 0000000..e232715 --- /dev/null +++ b/docs/tools/grip/gripper/proxy.md @@ -0,0 +1,50 @@ +--- +title: External Resource Proxies + +menu: + main: + parent: gripper + weight: 2 +--- + +# GRIPPER + +## GRIPPER proxy + +With the external resources normalized to a single data model, the graph model +describes how to connect the set of collections into a graph model. Each GRIPPER +is required to provide a GRPC interface that allows access to collections stored +in the resource. + +The required functions include: + +``` +rpc GetCollections(Empty) returns (stream Collection); +``` +`GetCollections` returns a list of all of the Collections accessible via this server. + +``` +rpc GetCollectionInfo(Collection) returns (CollectionInfo); +``` +`GetCollectionInfo` provides information, such as the list of indexed fields, in a collection. + +``` +rpc GetIDs(Collection) returns (stream RowID); +``` +`GetIDs` returns a stream of all of the IDs found in a collection. + +``` +rpc GetRows(Collection) returns (stream Row); +``` +`GetRows` returns a stream of all of the rows in a collection. + +``` +rpc GetRowsByID(stream RowRequest) returns (stream Row); +``` +`GetRowsByID` accepts a stream of row requests, each one requesting a single row +by it's id, and then returns a stream of results. + +``` +rpc GetRowsByField(FieldRequest) returns (stream Row); +``` +`GetRowsByField` searches a collection, looking for values found in an indexed field. diff --git a/docs/tools/grip/index.md b/docs/tools/grip/index.md new file mode 100644 index 0000000..b89a232 --- /dev/null +++ b/docs/tools/grip/index.md @@ -0,0 +1,42 @@ +--- +title: GRIP +--- + +GRIP (Graph Resource Integration Platform) is a powerful framework for building and managing distributed data processing systems. Key features include: + +- **Distributed Computing**: Scalable processing across multiple nodes. +- **Database Integration**: Built-in support for MongoDB, PostgreSQL, and SQL databases. +- **API Endpoints**: RESTful APIs for managing data workflows and monitoring. +- **Flexible Query Language**: GRIPQL for complex data queries and transformations. +- **Job Management**: Schedule, monitor, and manage data processing jobs in real-time. + + + +``` +# Start server +$ grip server --config grip.yml + +# List all graphs +$ grip list + +# Create a graph +$ grip create example + +# Drop a graph +$ grip drop example + +# Load data into a graph +$ grip load example --edge edges.txt --vertex vertices.txt + +# Query a graph +$ grip query example 'V().hasLabel("users")' + +#Get vertex/edge counts for a graph +$ grip info example + +# Get the schema for a graph +$ grip schema get example + +# Dump vertices/edges from a graph +$ grip dump example --vertex +``` \ No newline at end of file diff --git a/docs/tools/grip/jobs_api.md b/docs/tools/grip/jobs_api.md new file mode 100644 index 0000000..7589015 --- /dev/null +++ b/docs/tools/grip/jobs_api.md @@ -0,0 +1,63 @@ +--- +title: Jobs API +menu: + main: + identifier: Jobs + weight: 40 +--- + +# Jobs API + +Not all queries return instantaneously, additionally some queries elements are used +repeatedly. The query Jobs API provides a mechanism to submit graph traversals +that will be evaluated asynchronously and can be retrieved at a later time. + + +### Submitting a job + +``` +job = G.V().hasLabel("Planet").out().submit() +``` + +### Getting job status +``` +jinfo = G.getJob(job["id"]) +``` + +Example job info: +```json +{ + "id": "job-326392951", + "graph": "test_graph_qd7rs7", + "state": "COMPLETE", + "count": "12", + "query": [{"v": []}, {"hasLabel": ["Planet"]}, {"as": "a"}, {"out": []}], + "timestamp": "2021-03-30T23:12:01-07:00" +} +``` + +### Reading job results +``` +for row in G.readJob(job["id"]): + print(row) +``` + +### Search for jobs + +Find jobs that match the prefix of the current request (example should find job from G.V().hasLabel("Planet").out()) + +``` +jobs = G.V().hasLabel("Planet").out().out().count().searchJobs() +``` + +If there are multiple jobs that match the prefix of the search, all of them will be returned. It will be a client side +job to decide which of the jobs to use as a starting point. This can either be the job with the longest matching prefix, or +the most recent job. Note, that if the underlying database has changed since the job was run, adding additional steps to the +traversal may produce inaccurate results. + +Once `job` has been selected from the returned list you can use these existing results and continue the traversal. + +``` +for res in G.resume(job["id"]).out().count(): + print(res) +``` diff --git a/docs/tools/grip/queries/aggregation.md b/docs/tools/grip/queries/aggregation.md new file mode 100644 index 0000000..0012a81 --- /dev/null +++ b/docs/tools/grip/queries/aggregation.md @@ -0,0 +1,84 @@ +--- +title: Aggregation +menu: + main: + parent: Queries + weight: 6 +--- + +# Aggregation + +These methods provide a powerful way to analyze and summarize data in your GripQL graph database. They allow you to perform various types of aggregations, including term frequency, histograms, percentiles, and more. By combining these with other traversal functions like `has`, `hasLabel`, etc., you can create complex queries that extract specific insights from your data. + +## `.aggregate([aggregations])` +Groups and summarizes data from the graph. It allows you to perform calculations on vertex or edge properties. The following aggregation types are available: + +## Aggregation Types +### `.gripql.term(name, field, size)` +Return top n terms and their counts for a field. +```python +G.V().hasLabel("Person").aggregate(gripql.term("top-names", "name", 10)) +``` +Counts `name` occurrences across `Person` vertices and returns the 10 most frequent `name` values. + +### `.gripql.histogram(name, field, interval)` +Return binned counts for a field. +```python +G.V().hasLabel("Person").aggregate(gripql.histogram("age-hist", "age", 5)) +``` +Creates a histogram of `age` values with bins of width 5 across `Person` vertices. + +### `.gripql.percentile(name, field, percents=[])` +Return percentiles for a field. +```python +G.V().hasLabel("Person").aggregate(gripql.percentile("age-percentiles", "age", [25, 50, 75])) +``` +Calculates the 25th, 50th, and 75th percentiles for `age` values across `Person` vertices. + +### `.gripql.field("fields", "$")` +Returns all of the fields found in the data structure. Use `$` to get a listing of all fields found at the root level of the `data` property of vertices or edges. + +--- + +## `.count()` +Returns the total number of elements in the traversal. +```python +G.V().hasLabel("Person").count() +``` +This query returns the total number of vertices with the label "Person". + +--- + +## `.distinct([fields])` +Filters the traversal to return only unique elements. If `fields` are provided, uniqueness is determined by the combination of values in those fields; otherwise, the `_id` is used. +```python +G.V().hasLabel("Person").distinct(["name", "age"]) +``` +This query returns only unique "Person" vertices, where uniqueness is determined by the combination of "name" and "age" values. + +--- + +## `.sort([fields])` +Sort the output using the field values. You can sort in ascending or descending order by providing `descending=True` as an argument to `sort()` method. +```python +G.V().hasLabel("Person").sort("age") +``` +This query sorts "Person" vertices based on their age in ascending order. + +## `.limit(n)` +Limits the number of results returned by your query. +```python +G.V().hasLabel("Person").limit(10) +``` +This query limits the results to the first 10 "Person" vertices found. + +--- + +## `.skip(n)` +Offsets the results returned by your query. +```python +G.V().hasLabel("Person").skip(5) +``` +This query skips the first 5 "Person" vertices and returns the rest. + + diff --git a/docs/tools/grip/queries/filtering.md b/docs/tools/grip/queries/filtering.md new file mode 100644 index 0000000..2baac70 --- /dev/null +++ b/docs/tools/grip/queries/filtering.md @@ -0,0 +1,151 @@ +--- +title: Filtering +menu: + main: + parent: Queries + weight: 4 +--- + +# Filtering in GripQL + +GripQL provides powerful filtering capabilities using the .has() method and various condition functions. +Here's a comprehensive guide:.has()The .has() method is used to filter elements (vertices or edges) based on specified conditions. + +Conditions are functions provided by the gripql module that define the filtering criteria. + +## Comparison Operators + +### `gripql.eq(variable, value)` +Equal to (==) + +``` +G.V().has(gripql.eq("symbol", "TP53")) +# Returns vertices where the 'symbol' property is equal to 'TP53'. +``` + +### `gripql.neq(variable, value)` +Not equal to (!=) + +``` +G.V().has(gripql.neq("symbol", "TP53")) +# Returns vertices where the 'symbol' property is not equal to 'TP53'. +``` + +### `gripql.gt(variable, value)` +Greater than (>) + +``` +G.V().has(gripql.gt("age", 45)) +# Returns vertices where the 'age' property is greater than 45. +``` + +### `gripql.lt(variable, value)` +Less than (<) +``` +G.V().has(gripql.lt("age", 45)) +# Returns vertices where the 'age' property is less than 45. +``` + +### `gripql.gte(variable, value)` +Greater than or equal to (>=) +``` +G.V().has(gripql.gte("age", 45)) +# Returns vertices where the 'age' property is greater than or equal to 45. +``` + +### `gripql.lte(variable, value)` +Less than or equal to (<=) + +``` +G.V().has(gripql.lte("age", 45)) +# Returns vertices where the 'age' property is less than or equal to 45. +``` + +--- + +## Range Operators + +### `gripql.inside(variable, [lower_bound, upper_bound])` +lower_bound < variable < upper_bound (exclusive) + +``` +G.V().has(gripql.inside("age", [30, 45])) +# Returns vertices where the 'age' property is greater than 30 and less than 45. +``` + +### `gripql.outside(variable, [lower_bound, upper_bound])` +variable < lower_bound OR variable > upper_bound + +``` +G.V().has(gripql.outside("age", [30, 45])) +# Returns vertices where the 'age' property is less than 30 or greater than 45. +``` + +### `gripql.between(variable, [lower_bound, upper_bound])` +lower_bound <= variable < upper_bound + +``` +G.V().has(gripql.between("age", [30, 45])) +# Returns vertices where the 'age' property is greater than or equal to 30 and less than 45. +``` + +--- + +## Set Membership Operators + +### `gripql.within(variable, values)` +variable is in values + +``` +G.V().has(gripql.within("symbol", ["TP53", "BRCA1"])) +# Returns vertices where the 'symbol' property is either 'TP53' or 'BRCA1'. +``` + +### `gripql.without(variable, values)` +variable is not in values + +``` +G.V().has(gripql.without("symbol", ["TP53", "BRCA1"])) +# Returns vertices where the 'symbol' property is neither 'TP53' nor 'BRCA1'. +``` + +--- + +## String/Array Containment + +### `gripql.contains(variable, value)` +The variable (which is typically a list/array) contains value. + +``` +G.V().has(gripql.contains("groups", "group1")) +# Returns vertices where the 'groups' property (which is a list) contains the value "group1". +# Example: {"groups": ["group1", "group2", "group3"]} would match. +``` + +--- + +## Logical Operators + +### `gripql.and_([condition1, condition2, ...])` +Logical AND; all conditions must be true. + +``` +G.V().has(gripql.and_([gripql.lte("age", 45), gripql.gte("age", 35)])) +# Returns vertices where the 'age' property is less than or equal to 45 AND greater than or equal to 35. +``` + +### `gripql.or_([condition1, condition2, ...])` +Logical OR; at least one condition must be true. + +``` +G.V().has(gripql.or_([gripql.eq("symbol", "TP53"), gripql.eq("symbol", "BRCA1")])) +# Returns vertices where the 'symbol' property is either 'TP53' OR 'BRCA1'. +``` + +### `gripql.not_(condition)` +Logical NOT; negates the condition + +``` +G.V().has(gripql.not_(gripql.eq("symbol", "TP53"))) +# Returns vertices where the 'symbol' property is NOT equal to 'TP53'. +``` diff --git a/docs/tools/grip/queries/index.md b/docs/tools/grip/queries/index.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/tools/grip/queries/iterations.md b/docs/tools/grip/queries/iterations.md new file mode 100644 index 0000000..5af26c9 --- /dev/null +++ b/docs/tools/grip/queries/iterations.md @@ -0,0 +1,61 @@ +--- +title: Iteration +menu: + main: + parent: Queries + weight: 16 +--- + +# Iteration Commands + +A common operation in graph search is the ability to iteratively repeat a search pattern. For example, a 'friend of a friend' search may become a 'friend of a friend of a friend' search. In the GripQL language cycles, iterations and conditional operations are encoded using 'mark' and 'jump' based interface. This operation is similar to using a 'goto' statement in traditional programming languages. While more primitive than the repeat mechanisms seen in Gremlin, this pattern allows for much simpler query compilation and implementation. + +However, due to security concerns regarding potential denial of service attacks that could be created with the use of 'mark' and 'jump', these operations are restricted in most accounts. This is enforced by the server rejecting any queries from unauthorized users that utilize these commands without execution. In future upgrades, a proposed security feature will also allow the server to track the total number of iterations a traveler has made in a cycle and provide a hard cutoff. For example, a user could submit code with a maximum of 5 iterations. + +## Operation Commands +### `.mark(name)` +Mark a segment in the stream processor, with a name, that can receive jumps. This command is used to label sections of the query operation list that can accept travelers from the `jump` command. + +**Parameters:** +- `name` (str): The name given to the marked segment. + +### jump(dest, condition, emit) +If a condition is true, send traveler to mark. If `emit` is True, also send a copy down the processing chain. If `condition` is None, always do the jump. This command is used to move travelers from one marked segment to another based on a specified condition. + +**Parameters:** +- `dest` (str): The name of the destination mark segment. Travelers are moved to this point when their position matches the `condition` parameter. +- `condition` (_expr_ or None): An expression that determines if the traveler should jump. If it evaluates to True, the traveler jumps to the specified destination. If None, the traveler always jumps to the specified destination. +- `emit` (bool): Determines whether a copy of the traveler is emitted down the processing chain after jumping. If False, only the original traveler is processed. + +### `.set(field, value)` +Set values within the traveler's memory. These values can be used to store cycle counts. This command sets a field in the traveler's memory to a specified value. + +**Parameters:** +- `field` (str): The name of the field to set. +- `value` (_expr_): The value to set for the specified field. This can be any valid GripQL expression that resolves to a scalar value. + +### `.increment(field, value)` +Increment a field by a specified value. This command increments a field in the traveler's memory by a specified amount. + +**Parameters:** +- `field` (str): The name of the field to increment. +- `value` (_expr_): The amount to increment the specified field by. This can be any valid GripQL expression that resolves to an integer value. + +## Example Queries +The following examples demonstrate how to use these commands in a query: + +```python +q = G.V("Character:1").set("count", 0).as_("start").mark("a").out().increment("$start.count") +q = q.has(gripql.lt("$start.count", 2)) +q = q.jump("a", None, True) +``` +This query starts from a vertex with the ID "Character:1". It sets a field named "count" to 0 and annotates this vertex as "start". Then it marks this position in the operation list for future reference. The `out` command moves travelers to the outgoing edges of their current positions, incrementing the "count" field each time. If the count is less than 2, the traveler jumps back to the marked location, effectively creating a loop. + +```python +q = G.V("Character:1").set("count", 0).as_("start").mark("a").out().increment("$start.count") +q = q.has(gripql.lt("$start.count", 2)) +q = q.jump("a", None, False) +``` +This query is similar to the previous one, but in this case, the traveler only jumps back without emitting a copy down the processing chain. The result is that only one vertex will be included in the output, even though there are multiple iterations due to the jump command. + +In both examples, the use of `mark` and `jump` commands create an iterative pattern within the query operation list, effectively creating a 'friend of a friend' search that can repeat as many times as desired. These patterns are crucial for complex graph traversals in GripQL. diff --git a/docs/tools/grip/queries/jobs.md b/docs/tools/grip/queries/jobs.md new file mode 100644 index 0000000..49e132f --- /dev/null +++ b/docs/tools/grip/queries/jobs.md @@ -0,0 +1,26 @@ + + + +## .submit() +Post the traversal as an asynchronous job and get a job ID. + +Example: Submit a query to be processed in the background + +```python +job_id = G.V('vertexID').hasLabel('Vertex').submit() +print(job_id) # print job ID +``` +--- + +## .searchJobs() +Find jobs that match this query and get their status and results if available. + +Example: Search for jobs with the specified query and print their statuses and results + +```python +for result in G.V('vertexID').hasLabel('Vertex').searchJobs(): + print(result['status']) # print job status + if 'results' in result: + print(result['results']) # print job results +``` +--- diff --git a/docs/tools/grip/queries/jsonpath.md b/docs/tools/grip/queries/jsonpath.md new file mode 100644 index 0000000..2f26511 --- /dev/null +++ b/docs/tools/grip/queries/jsonpath.md @@ -0,0 +1,84 @@ +--- +title: Referencing Fields +menu: + main: + parent: Queries + weight: 2 +--- + +# Referencing Vertex/Edge Properties + +Several operations (where, fields, render, etc.) reference properties of the vertices/edges during the traversal. +GRIP uses a variation on JSONPath syntax as described in http://goessner.net/articles/ to reference fields during traversals. + +The following query: + +``` +O.V(["ENSG00000012048"]).as_("gene").out("variant") +``` + +Starts at vertex `ENSG00000012048` and marks as `gene`: + +```json +{ + "_id": "ENSG00000012048", + "_label": "gene", + "symbol": { + "ensembl": "ENSG00000012048", + "hgnc": 1100, + "entrez": 672, + "hugo": "BRCA1" + }, + "transcipts": ["ENST00000471181.7", "ENST00000357654.8", "ENST00000493795.5"] +} +``` + +as "gene" and traverses the graph to: + +```json +{ + "_id": "NM_007294.3:c.4963_4981delTGGCCTGACCCCAGAAG", + "_label": "variant", + "type": "deletion", + "publications": [ + { + "pmid": 29480828, + "doi": "10.1097/MD.0000000000009380" + }, + { + "pmid": 23666017, + "doi": "10.1097/IGC.0b013e31829527bd" + } + ] +} +``` + +Below is a table of field and the values they would reference in subsequent traversal operations. + +| jsonpath | result | +| :------------------------- | :------------------- | +| _id | "NM_007294.3:c.4963_4981delTGGCCTGACCCCAGAAG" | +| _label | "variant" | +| type | "deletion" | +| publications[0].pmid | 29480828 | +| publications[:].pmid | [29480828, 23666017] | +| publications.pmid | [29480828, 23666017] | +| $gene.symbol.hugo | "BRCA1" | +| $gene.transcripts[0] | "ENST00000471181.7" | + + +## Usage Example: + +``` +O.V(["ENSG00000012048"]).as_("gene").out("variant").render({"variant_id": "_id", "variant_type": "type", "gene_id": "$gene._id"}) +``` + +returns + +``` +{ + "variant_id": "NM_007294.3:c.4963_4981delTGGCCTGACCCCAGAAG", + "variant_type": "deletion", + "gene_id": "ENSG00000012048" +} +``` diff --git a/docs/tools/grip/queries/output.md b/docs/tools/grip/queries/output.md new file mode 100644 index 0000000..964e776 --- /dev/null +++ b/docs/tools/grip/queries/output.md @@ -0,0 +1,74 @@ +--- +title: Output Control +menu: + main: + parent: Queries + weight: 10 +--- + +--- + +# Output control + +## `.limit(count)` +Limit number of total output rows +```python +G.V().limit(5) +``` +--- +## `.skip(count)` +Start return after offset + +Example: +```python +G.V().skip(10).limit(5) + +``` +This query skips the first 10 vertices and then returns the next 5. +--- +## `.range(start, stop)` +Selects a subset of the results based on their index. `start` is inclusive, and `stop` is exclusive. +Example: +```python +G.V().range(5, 10) +``` +--- +## `.fields([fields])` +Specifies which fields of a vertex or edge to include or exclude in the output. By default, `_id`, `_label`, `_from`, and `_to` are included. + +If `fields` is empty, all properties are excluded. +If `fields` contains field names, only those properties are included. +If `fields` contains field names prefixed with `-`, those properties are excluded, and all others are included. + +Examples: + +Include only the 'symbol' property: +```python +G.V("vertex1").fields(["symbol"]) +``` + +Exclude the 'symbol' property: +```python +G.V("vertex1").fields(["-symbol"]) +``` +Exclude all properties: +```python +G.V("vertex1").fields([]) +``` + +--- + +## `.render(template)` + +Transforms the current selection into an arbitrary data structure defined by the `template`. The `template` is a string that can include placeholders for vertex/edge properties. + +Example: +```python +G.V("vertex1").render( {"node_info" : {"id": "$._id", "label": "$._label"}, "data" : {"whatToExpect": "$.climate"}} ) +``` + +Assuming `vertex1` has `_id`, `_label`, and `symbol` properties, this would return a JSON object with those fields. + +```json +{"node_info" : {"id" :"Planet:2", "label":"Planet"}, "data":{"whatToExpect":"arid"} } +``` diff --git a/docs/tools/grip/queries/record_transforms.md b/docs/tools/grip/queries/record_transforms.md new file mode 100644 index 0000000..bf3d589 --- /dev/null +++ b/docs/tools/grip/queries/record_transforms.md @@ -0,0 +1,131 @@ +--- +title: Record Transforms +menu: + main: + parent: Queries + weight: 5 +--- + + +# Record Manipulation + +## `.unwind(fields)` +Expands an array-valued field into multiple rows, one for each element in the array. +Example: + +Graph +```python +{"vertex" : {"_id":"1", "_label":"Thing", "stuff" : ["1", "2", "3"]}} +``` + +Query +```python +G.V("1").unwind("stuff") +``` + +Result +```json +{"_id":"1", "_label":"Thing", "stuff" : "1"} +{"_id":"1", "_label":"Thing", "stuff" : "2"} +{"_id":"1", "_label":"Thing", "stuff" : "3"} +``` + +## `.group({"dest":"field"})` +Collect all travelers that are on the same element while aggregating specific fields + +For the example: +```python +G.V().hasLabel("Planet").as_("planet").out("residents").as_("character").select("planet").group( {"people" : "$character.name"} ) +``` + +All of the travelers that start on the same planet go out to residents, collect them using the `as_` and then returning to the origin + +using the `select` statement. The group statement aggrigates the `name` fields from the character nodes that were visited and collects them + +into a list named `people` that is added to the current planet node. + +Output: +```json +{ + "vertex": { + "_id": "Planet:2", + "_label": "Planet", + "climate": "temperate", + "diameter": 12500, + "gravity": null, + "name": "Alderaan", + "orbital_period": 364, + "people": [ + "Leia Organa", + "Raymus Antilles" + ], + "population": 2000000000, + "rotation_period": 24, + "surface_water": 40, + "system": { + "created": "2014-12-10T11:35:48.479000Z", + "edited": "2014-12-20T20:58:18.420000Z" + }, + "terrain": [ + "grasslands", + "mountains" + ], + "url": "https://swapi.co/api/planets/2/" + } +} +{ + "vertex": { + "_id": "Planet:1", + "_label": "Planet", + "climate": "arid", + "diameter": 10465, + "gravity": null, + "name": "Tatooine", + "orbital_period": 304, + "people": [ + "Luke Skywalker", + "C-3PO", + "Darth Vader", + "Owen Lars", + "Beru Whitesun lars", + "R5-D4", + "Biggs Darklighter" + ], + "population": 200000, + "rotation_period": 23, + "surface_water": 1, + "system": { + "created": "2014-12-09T13:50:49.641000Z", + "edited": "2014-12-21T20:48:04.175778Z" + }, + "terrain": [ + "desert" + ], + "url": "https://swapi.co/api/planets/1/" + } +} +``` + +## `.pivot(id, key, value)` + +Aggregate fields across multiple records into a single record using a pivot operations. A pivot is +an operation where a two column matrix, with one columns for keys and another column for values, is +transformed so that the keys are used to name the columns and the values are put in those columns. + +So the stream of vertices: + +``` +{"_id":"observation_a1", "_label":"Observation", "subject":"Alice", "key":"age", "value":36} +{"_id":"observation_a2", "_label":"Observation", "subject":"Alice", "key":"sex", "value":"Female"} +{"_id":"observation_a3", "_label":"Observation", "subject":"Alice", "key":"blood_pressure", "value":"111/78"} +{"_id":"observation_b1", "_label":"Observation", "subject":"Bob", "key":"age", "value":42} +{"_id":"observation_b2", "_label":"Observation", "subject":"Bob", "key":"sex", "value":"Male"} +{"_id":"observation_b3", "_label":"Observation", "subject":"Bob", "key":"blood_pressure", "value":"120/80"} +``` + +with `.pivot("subject", "key", "value")` will produce: + +``` +{"_id":"Alice", "age":36, "sex":"Female", "blood_pressure":"111/78"} +{"_id":"Bob", "age":42, "sex":"Male", "blood_pressure":"120/80"} +``` diff --git a/docs/tools/grip/queries/traversal_start.md b/docs/tools/grip/queries/traversal_start.md new file mode 100644 index 0000000..6a4fd1d --- /dev/null +++ b/docs/tools/grip/queries/traversal_start.md @@ -0,0 +1,30 @@ + +--- +title: Start a Traversal +menu: + main: + parent: Queries + weight: 1 +--- + +# Start a Traversal + +All traversal based queries must start with a `V()` command, starting the travalers on the vertices of the graph. + +## `.V([ids])` +Start query from Vertex + +```python +G.V() +``` + +Returns all vertices in graph + +```python +G.V(["vertex1"]) +``` + +Returns: +```json +{"_id" : "vertex1", "_label":"TestVertex"} +``` diff --git a/docs/tools/grip/queries/traverse_graph.md b/docs/tools/grip/queries/traverse_graph.md new file mode 100644 index 0000000..568f27d --- /dev/null +++ b/docs/tools/grip/queries/traverse_graph.md @@ -0,0 +1,76 @@ +--- +title: Traverse the Graph +menu: + main: + parent: Queries + weight: 3 +--- + +# Traverse the graph +To move travelers between different elements of the graph, the traversal commands `in_` and `out` move along the edges, respecting the directionality. The `out` commands follow `_from` to `_to`, while the `in_` command follows `_to` to `_from`. + +## `.in_(), inV()` +Following incoming edges. Optional argument is the edge label (or list of labels) that should be followed. If no argument is provided, all incoming edges. + +```python +G.V().in_(label=['edgeLabel1', 'edgeLabel2']) +``` +--- + +## `.out(), .outV()` +Following outgoing edges. Optional argument is the edge label (or list of labels) that should be followed. If no argument is provided, all outgoing edges. + +```python +G.V().out(label='edgeLabel') +``` +--- + +## `.both(), .bothV()` +Following all edges (both in and out). Optional argument is the edge label (or list of labels) that should be followed. If no argument is provided, all edges. + +```python +G.V().outE().both(label='edgeLabel') +``` +--- + +## `.inE()` +Following incoming edges, but return the edge as the next element. This can be used to inspect edge properties. Optional argument is the edge label (or list of labels) that should be followed. To return back to a vertex, use `.in_` or `.out` + +```python +G.V().inE(label='edgeLabel') +``` +--- + +## `.outE()` +Following outgoing edges, but return the edge as the next element. This can be used to inspect edge properties. Optional argument is the edge label (or list of labels) that should be followed. To return back to a vertex, use `.in_` or `.out` + +```python +G.V().outE(label='edgeLabel') +``` +--- + +## `.bothE()` +Following all edges, but return the edge as the next element. This can be used to inspect edge properties. Optional argument is the edge label (or list of labels) that should be followed. To return back to a vertex, use `.in_` or `.out` + +```python +G.V().bothE(label='edgeLabel') +``` +--- + +# AS and SELECT + +The `as_` and `select` commands allow a traveler to mark a step in the traversal and return to it as a later step. + +## `.as_(name)` +Store current row for future reference + +```python +G.V().as_("a").out().as_("b") +``` + +## `.select(name)` +Move traveler to previously marked position + +```python +G.V().mark("a").out().mark("b").select("a") +``` diff --git a/docs/tools/grip/security/basic.md b/docs/tools/grip/security/basic.md new file mode 100644 index 0000000..4bf232e --- /dev/null +++ b/docs/tools/grip/security/basic.md @@ -0,0 +1,60 @@ +--- +title: Basic Auth + +menu: + main: + parent: Security + weight: 1 +--- + +# Basic Auth + +By default, an GRIP server allows open access to its API endpoints, but it +can be configured to require basic password authentication. To enable this, +include users and passwords in your config file: + +```yaml +Server: + BasicAuth: + - User: testuser + Password: abc123 +``` + +Make sure to properly protect the configuration file so that it's not readable +by everyone: + +```bash +$ chmod 600 grip.config.yml +``` + +To use the password, set the `GRIP_USER` and `GRIP_PASSWORD` environment variables: +```bash +$ export GRIP_USER=testuser +$ export GRIP_PASSWORD=abc123 +$ grip list +``` + +## Using the Python Client + +Some GRIP servers may require authorizaiton to access its API endpoints. The client can be configured to pass +authorization headers in its requests: + +```python +import gripql + +# Basic Auth Header - {'Authorization': 'Basic dGVzdDpwYXNzd29yZA=='} +G = gripql.Connection("https://bmeg.io", user="test", password="password").graph("bmeg") +``` + +Although GRIP only supports basic password authentication, some servers may be proctected via a nginx or apache +server. The python client can be configured to handle these cases as well: + +```python +import gripql + +# Bearer Token - {'Authorization': 'Bearer iamnotarealtoken'} +G = gripql.Connection("https://bmeg.io", token="iamnotarealtoken").graph("bmeg") + +# OAuth2 / Custom - {"OauthEmail": "fake.user@gmail.com", "OauthAccessToken": "iamnotarealtoken", "OauthExpires": 1551985931} +G = gripql.Connection("https://bmeg.io", credential_file="~/.grip_token.json").graph("bmeg") +``` diff --git a/docs/tools/grip/tutorials/amazon.md b/docs/tools/grip/tutorials/amazon.md new file mode 100644 index 0000000..f215d7c --- /dev/null +++ b/docs/tools/grip/tutorials/amazon.md @@ -0,0 +1,75 @@ +--- +title: Amazon Purchase Network + +menu: + main: + parent: Tutorials + weight: 1 +--- + +# Explore Amazon Product Co-Purchasing Network Metadata + +Download the data + +``` +curl -O http://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz +``` + +Convert the data into vertices and edges + +``` +python $GOPATH/src/github.com/bmeg/grip/example/amazon_convert.py amazon-meta.txt.gz amazon.data +``` + +Turn on grip and create a graph called 'amazon' + +``` +grip server & ; sleep 1 ; grip create amazon +``` + +Load the vertices/edges into the graph + +``` +grip load amazon --edge amazon.data.edge --vertex amazon.data.vertex +``` + +Query the graph + +_command line client_ + +``` +grip query amazon 'V().hasLabel("Video").out()' +``` + +The full command syntax and command list can be found at grip/gripql/javascript/gripql.js + +_python client_ + +Initialize a virtual environment and install gripql python package + +``` +python -m venv venv ; source venv/bin/activate +pip install -e gripql/python +``` + +Example code + +```python +import gripql + +conn = gripql.Connection("http://localhost:8201") + +g = conn.graph("amazon") + +# Count the Vertices +print("Total vertices: ", g.V().count().execute()) +# Count the Edges +print("Total edges: ", g.V().outE().count().execute()) + +# Try simple travesral +print("Edges connected to 'B00000I06U' vertex: %s" %g.V("B00000I06U").outE().execute()) + +# Find every Book that is similar to a DVD +for result in g.V().has(gripql.eq("group", "Book")).as_("a").out("similar").has(gripql.eq("group", "DVD")).as_("b").select("a"): + print(result) +``` diff --git a/docs/tools/grip/tutorials/pathway-commons.md b/docs/tools/grip/tutorials/pathway-commons.md new file mode 100644 index 0000000..d0d2308 --- /dev/null +++ b/docs/tools/grip/tutorials/pathway-commons.md @@ -0,0 +1,11 @@ + + +Get Pathway Commons release +``` +curl -O http://www.pathwaycommons.org/archives/PC2/v10/PathwayCommons10.All.BIOPAX.owl.gz +``` + +Convert to Property Graph +``` +grip rdf --dump --gzip pc PathwayCommons10.All.BIOPAX.owl.gz -m "http://pathwaycommons.org/pc2/#=pc:" -m "http://www.biopax.org/release/biopax-level3.owl#=biopax:" +``` diff --git a/docs/tools/grip/tutorials/tcga-rna.md b/docs/tools/grip/tutorials/tcga-rna.md new file mode 100644 index 0000000..3098295 --- /dev/null +++ b/docs/tools/grip/tutorials/tcga-rna.md @@ -0,0 +1,133 @@ +--- +title: TCGA RNA Expression + +menu: + main: + parent: Tutorials + weight: 2 +--- + +### Explore TCGA RNA Expression Data + +Create the graph + +``` +grip create tcga-rna +``` + +Get the data + +``` +curl -O http://download.cbioportal.org/gbm_tcga_pub2013.tar.gz +tar xvzf gbm_tcga_pub2013.tar.gz +``` + +Load clinical data + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_clinical.txt --row-label 'Donor' +``` + +Load RNASeq data + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_RNA_Seq_v2_expression_median.txt -t --index-col 1 --row-label RNASeq --row-prefix "RNA:" --exclude RNA:Hugo_Symbol +``` + +Connect RNASeq data to Clinical data + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_RNA_Seq_v2_expression_median.txt -t --index-col 1 --no-vertex --edge 'RNA:{_id}' rna +``` + +Connect Clinical data to subtypes + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_clinical.txt --no-vertex -e "{EXPRESSION_SUBTYPE}" subtype --dst-vertex "{EXPRESSION_SUBTYPE}" Subtype +``` + +Load Hugo Symbol to EntrezID translation table from RNA matrix annotations + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_RNA_Seq_v2_expression_median.txt --column-include Entrez_Gene_Id --row-label Gene +``` + +Load Mutation Information + +``` +./example/load_matrix.py tcga-rna gbm_tcga_pub2013/data_mutations_extended.txt --skiprows 1 --index-col -1 --regex Matched_Norm_Sample_Barcode '\-\d\d$' '' --edge '{Matched_Norm_Sample_Barcode}' variantIn --edge '{Hugo_Symbol}' effectsGene --column-exclude ma_func.impact ma_fi.score MA_FI.score MA_Func.Impact MA:link.MSA MA:FImpact MA:protein.change MA:link.var MA:FIS MA:link.PDB --row-label Variant +``` + +Load Proneural samples into a matrix + +```python +import pandas +import gripql + +conn = gripql.Connection("http://localhost:8201") +g = conn.graph("tcga-rna") +genes = {} +for k, v in g.V().hasLabel("Gene").render(["_id", "Hugo_Symbol"]): + genes[k] = v +data = {} +for row in g.V("Proneural").in_().out("rna").render(["_id", "_data"]): + data[row[0]] = row[1] +samples = pandas.DataFrame(data).rename(genes).transpose().fillna(0.0) +``` + +# Matrix Load project + +``` +usage: load_matrix.py [-h] [--sep SEP] [--server SERVER] + [--row-label ROW_LABEL] [--row-prefix ROW_PREFIX] [-t] + [--index-col INDEX_COL] [--connect] + [--col-label COL_LABEL] [--col-prefix COL_PREFIX] + [--edge-label EDGE_LABEL] [--edge-prop EDGE_PROP] + [--columns [COLUMNS [COLUMNS ...]]] + [--column-include COLUMN_INCLUDE] [--no-vertex] + [-e EDGE EDGE] [--dst-vertex DST_VERTEX DST_VERTEX] + [-x EXCLUDE] [-d] + db input + +positional arguments: + db Destination Graph + input Input File + +optional arguments: + -h, --help show this help message and exit + --sep SEP TSV delimiter + --server SERVER Server Address + --row-label ROW_LABEL + Vertex Label used when loading rows + --row-prefix ROW_PREFIX + Prefix added to row vertex id + -t, --transpose Transpose matrix + --index-col INDEX_COL + Column number to use as index (and id for vertex + load) + --connect Switch to 'fully connected mode' and load matrix cell + values on edges between row and column names + --col-label COL_LABEL + Column vertex label in 'connect' mode + --col-prefix COL_PREFIX + Prefix added to col vertex id in 'connect' mode + --edge-label EDGE_LABEL + Edge label for edges in 'connect' mode + --edge-prop EDGE_PROP + Property name for storing value when in 'connect' mode + --columns [COLUMNS [COLUMNS ...]] + Rename columns in TSV + --column-include COLUMN_INCLUDE + List subset of columns to use from TSV + --no-vertex Do not load row as vertex + -e EDGE EDGE, --edge EDGE EDGE + Create an edge the connected the current row vertex + args: + --dst-vertex DST_VERTEX DST_VERTEX + Create a destination vertex, args: + + -x EXCLUDE, --exclude EXCLUDE + Exclude row id + -d Run in debug mode. Print actions and make no changes + +``` diff --git a/docs/tools/index.md b/docs/tools/index.md new file mode 100644 index 0000000..195f28d --- /dev/null +++ b/docs/tools/index.md @@ -0,0 +1,34 @@ +# CALYPR Tools Ecosystem + +The CALYPR platform provides a suite of powerful, open-source tools designed to handle every stage of the genomic data lifecycle—from ingestion and versioning to distributed analysis and graph-based discovery. + +--- + +### [Git-DRS](git-drs/index.md) +**The Version Control Layer.** +Git-DRS is a specialized extension for Git that manages massive genomic datasets using the GA4GH Data Repository Service (DRS) standard. It allows researchers to track, version, and share petabyte-scale files as easily as code, replacing heavy binaries with lightweight pointer files that resolve to immutable cloud objects. + +### [Funnel](funnel/index.md) +**The Compute Layer.** +Funnel is a distributed task execution engine that implements the GA4GH Task Execution Service (TES) API. It provides a standardized way to run Docker-based analysis pipelines across diverse environments—including Kubernetes, AWS, and Google Cloud—ensuring that your workflows are portable and independent of the underlying infrastructure. + +### [GRIP](grip/index.md) +**The Discovery Layer.** +GRIP (Graph Resource Integration Platform) is a high-performance graph database and query engine designed for complex biological data. It enables analysts to integrate heterogeneous datasets into a unified knowledge graph and perform sophisticated queries that reveal deep relational insights across multi-omic cohorts. + + +--- + +## Choosing the Right Tool + +| If you want to... | Use this tool | +| --- | --- | +| Version and share large genomic files | **Git-DRS** | +| Run batch analysis or Nextflow pipelines | **Funnel** | +| Query complex relationships between datasets | **GRIP** | +| Access Gen3 data from the command line | **Data Client** | + +--- + +!!! tip "Getting Started" + If you are new to the platform, we recommend starting with the [Quick Start Guide](../calypr/quick-start.md) to install the necessary binaries and set up your first workspace. diff --git a/docs/tools/sifter/.nav.yml b/docs/tools/sifter/.nav.yml new file mode 100644 index 0000000..67eeab2 --- /dev/null +++ b/docs/tools/sifter/.nav.yml @@ -0,0 +1,9 @@ +title: Sifter +nav: + - index.md + - docs/example.md + - docs/schema.md + - docs/config.md + - docs/inputs + - docs/transforms + - docs/outputs diff --git a/docs/tools/sifter/assets/sifter_example.png b/docs/tools/sifter/assets/sifter_example.png new file mode 100644 index 0000000..284e0dd Binary files /dev/null and b/docs/tools/sifter/assets/sifter_example.png differ diff --git a/docs/tools/sifter/docs/.nav.yml b/docs/tools/sifter/docs/.nav.yml new file mode 100644 index 0000000..1d7fa65 --- /dev/null +++ b/docs/tools/sifter/docs/.nav.yml @@ -0,0 +1,11 @@ + +title: Sifter Documentation + +nav: + - index.md + - example.md + - schema.md + - config.md + - inputs + - transforms + - outputs \ No newline at end of file diff --git a/docs/tools/sifter/docs/config.md b/docs/tools/sifter/docs/config.md new file mode 100644 index 0000000..38ab63d --- /dev/null +++ b/docs/tools/sifter/docs/config.md @@ -0,0 +1,34 @@ +--- +title: Paramaters +--- + +## Paramaters Variables + +Playbooks can be parameterized. They are defined in the `params` section of the playbook YAML file. + +### Configuration Syntax +```yaml +params: + variableName: + type: File # one of: File, Path, String, Number + default: "path/to/default" +``` + +### Supported Types +- `File`: Represents a file path +- `Dir`: Represents a directory path + +### Example Configuration +```yaml +params: + inputDir: + type: Dir + default: "/data/input" + outputDir: + type: Dir + default: "/data/output" + schemaFile: + type: File + default: "/config/schema.json" +``` + diff --git a/docs/tools/sifter/docs/developers/source_mapping.md b/docs/tools/sifter/docs/developers/source_mapping.md new file mode 100644 index 0000000..335e52f --- /dev/null +++ b/docs/tools/sifter/docs/developers/source_mapping.md @@ -0,0 +1,48 @@ +# SIFTER Project Documentation to Source Code Mapping + +## Inputs + +| Documentation File | Source Code File | +|-------------------|------------------| +| docs/docs/inputs/avro.md | extractors/avro_load.go | +| docs/docs/inputs/embedded.md | extractors/embedded.go | +| docs/docs/inputs/glob.md | extractors/glob_load.go | +| docs/docs/inputs/json.md | extractors/json_load.go | +| docs/docs/inputs/plugin.md | extractors/plugin_load.go | +| docs/docs/inputs/sqldump.md | extractors/sqldump_step.go | +| docs/docs/inputs/sqlite.md | extractors/sqlite_load.go | +| docs/docs/inputs/table.md | extractors/tabular_load.go | +| docs/docs/inputs/xml.md | extractors/xml_step.go | + +## Transforms + +| Documentation File | Source Code File | +|-------------------|------------------| +| docs/docs/transforms/accumulate.md | transform/accumulate.go | +| docs/docs/transforms/clean.md | transform/clean.go | +| docs/docs/transforms/debug.md | transform/debug.go | +| docs/docs/transforms/distinct.md | transform/distinct.go | +| docs/docs/transforms/fieldParse.md | transform/field_parse.go | +| docs/docs/transforms/fieldProcess.md | transform/field_process.go | +| docs/docs/transforms/fieldType.md | transform/field_type.go | +| docs/docs/transforms/filter.md | transform/filter.go | +| docs/docs/transforms/flatmap.md | transform/flat_map.go | +| docs/docs/transforms/from.md | transform/from.go | +| docs/docs/transforms/hash.md | transform/hash.go | +| docs/docs/transforms/lookup.md | transform/lookup.go | +| docs/docs/transforms/map.md | transform/mapping.go | +| docs/docs/transforms/objectValidate.md | transform/object_validate.go | +| docs/docs/transforms/plugin.md | transform/plugin.go | +| docs/docs/transforms/project.md | transform/project.go | +| docs/docs/transforms/reduce.md | transform/reduce.go | +| docs/docs/transforms/regexReplace.md | transform/regex.go | +| docs/docs/transforms/split.md | transform/split.go | +| docs/docs/transforms/uuid.md | transform/uuid.go | + +## Outputs + +| Documentation File | Source Code File | +|-------------------|------------------| +| docs/docs/outputs/graphBuild.md | playbook/output_graph.go | +| docs/docs/outputs/json.md | playbook/output_json.go | +| docs/docs/outputs/tableWrite.md | playbook/output_table.go | \ No newline at end of file diff --git a/docs/tools/sifter/docs/example.md b/docs/tools/sifter/docs/example.md new file mode 100644 index 0000000..3b0e03f --- /dev/null +++ b/docs/tools/sifter/docs/example.md @@ -0,0 +1,196 @@ +--- +render_macros: false +--- + +# Example Pipeline +Our first task will be to convert a ZIP code TSV into a set of county level +entries. + +The input file looks like: + +```csv +ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP +36003,Autauga County,AL,01001,H1 +36006,Autauga County,AL,01001,H1 +36067,Autauga County,AL,01001,H1 +36066,Autauga County,AL,01001,H1 +36703,Autauga County,AL,01001,H1 +36701,Autauga County,AL,01001,H1 +36091,Autauga County,AL,01001,H1 +``` + +First is the header of the pipeline. This declares the +unique name of the pipeline and it's output directory. + +```yaml +name: zipcode_map +outdir: ./ +docs: Converts zipcode TSV into graph elements +``` + +Next the parameters are declared. In this case the only parameter is the path to the +zipcode TSV. There is a default value, so the pipeline can be invoked without passing in +any parameters. However, to apply this pipeline to a new input file, the +input parameter `zipcode` could be used to define the source file. +Path and File Parameters can be relative to the directory that the playbook file is in. + +```yaml +params: + schema: + type: path + default: ../covid19_datadictionary/gdcdictionary/schemas/ + zipcode: + type: path + default: ../data/ZIP-COUNTY-FIPS_2017-06.csv +``` + +The `inputs` section declares data input sources. In this pipeline, there is +only one input, which is to run the table loader. +```yaml +inputs: + zipcode: + table: + path: "{{params.zipcode}}" + sep: "," +``` + +Tableload operaters of the input file that was originally passed in using the +`inputs` stanza. SIFTER string parsing is based on mustache template system. +To access the string passed in the template is `{{params.zipcode}}`. +The seperator in the file input file is a `,` so that is also passed in as a +parameter to the extractor. + + +The `table` extractor opens up the TSV and generates a one message for +every row in the file. It uses the header of the file to map the column values +into a dictionary. The first row would produce the message: + +```json +{ + "ZIP" : "36003", + "COUNTYNAME" : "Autauga County", + "STATE" : "AL", + "STCOUNTYFP" : "01001", + "CLASSFP" : "H1" +} +``` + +The stream of messages are then passed into the steps listed in the `transform` +section of the tableLoad extractor. + +For the current tranform, we want to produce a single entry per `STCOUNTYFP`, +however, the file has a line per `ZIP`. We need to run a `reduce` transform, +that collects rows togeather using a field key, which in this case is `"{{row.STCOUNTYFP}}"`, +and then runs a function `merge` that takes two messages, merges them togeather +and produces a single output message. + +The two messages: + +```json +{ "ZIP" : "36003", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"} +{ "ZIP" : "36006", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"} +``` + +Would be merged into the message: + +```json +{ "ZIP" : ["36003", "36006"], "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"} +``` + +The `reduce` transform step uses a block of python code to describe the function. +The `method` field names the function, in this case `merge` that will be used +as the reduce function. + +```yaml + zipReduce: + - from: zipcode + - reduce: + field: STCOUNTYFP + method: merge + python: > + def merge(x,y): + a = x.get('zipcodes', []) + [x['ZIP']] + b = y.get('zipcodes', []) + [y['ZIP']] + x['zipcodes'] = a + b + return x +``` + +The original messages produced by the loader have all of the information required +by the `summary_location` object type as described by the JSON schema that was linked +to in the header stanza. However, the data is all under the wrong field names. +To remap the data, we use a `project` tranformation that uses the template engine +to project data into new files in the message. The template engine has the current +message data in the value `row`. So the value +`FIPS:{{row.STCOUNTYFP}}` is mapped into the field `id`. + +```yaml + - project: + mapping: + id: "FIPS:{{row.STCOUNTYFP}}" + province_state: "{{row.STATE}}" + summary_locations: "{{row.STCOUNTYFP}}" + county: "{{row.COUNTYNAME}}" + submitter_id: "{{row.STCOUNTYFP}}" + type: summary_location + projects: [] +``` + +Using this projection, the message: + +```json +{ + "ZIP" : ["36003", "36006"], + "COUNTYNAME" : "Autauga County", + "STATE" : "AL", + "STCOUNTYFP" : "01001", + "CLASSFP" : "H1" +} +``` + +would become + +```json +{ + "id" : "FIPS:01001", + "province_state" : "AL", + "summary_locations" : "01001", + "county" : "Autauga County", + "submitter_id" : "01001", + "type" : "summary_location" + "projects" : [], + "ZIP" : ["36003", "36006"], + "COUNTYNAME" : "Autauga County", + "STATE" : "AL", + "STCOUNTYFP" : "01001", + "CLASSFP" : "H1" +} +``` + +Now that the data has been remapped, we pass the data into the 'objectValidate' +step, which will open the schema directory and find the class titled `summary_location`, check the +message to make sure it matches and then output it. + +```yaml + - objectValidate: + title: summary_location + schema: {{params.schema}} +``` + + +Outputs + +To create an output table, with two columns connecting +`ZIP` values to `STCOUNTYFP` values. The `STCOUNTYFP` is a county level FIPS +code, used by the census office. A single FIPS code my contain many ZIP codes, +and we can use this table later for mapping ids when loading the data into a database. + +```yaml +outputs: + zip2fips: + tableWrite: + from: zipReduce + path: zip2fips.tsv + columns: + - ZIP + - STCOUNTYFP +``` diff --git a/docs/tools/sifter/docs/inputs/avro.md b/docs/tools/sifter/docs/inputs/avro.md new file mode 100644 index 0000000..570afbb --- /dev/null +++ b/docs/tools/sifter/docs/inputs/avro.md @@ -0,0 +1,17 @@ +--- +title: avro +menu: + main: + parent: inputs + weight: 100 +--- + +# avro +Load an AvroFile + +## Parameters + +| name | Description | +| --- | --- | +| path | Path to input file | + diff --git a/docs/tools/sifter/docs/inputs/embedded.md b/docs/tools/sifter/docs/inputs/embedded.md new file mode 100644 index 0000000..2d2c717 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/embedded.md @@ -0,0 +1,21 @@ +--- +title: embedded +menu: + main: + parent: inputs + weight: 100 +--- + +# embedded +Load data from embedded structure + + +## Example + +```yaml +inputs: + data: + embedded: + - { "name" : "Alice", "age": 28 } + - { "name" : "Bob", "age": 27 } +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/glob.md b/docs/tools/sifter/docs/inputs/glob.md new file mode 100644 index 0000000..1bc4c01 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/glob.md @@ -0,0 +1,30 @@ +--- +title: glob +render_macros: false +--- + +# glob +Scan files using `*` based glob statement and open all files +as input. + +## Parameters + +| Name | Description | +|-------|--------| +| storeFilename | Store value of filename in parameter each row | +| input | Path of avro object file to transform | +| xml | xmlLoad configutation | +| table | Run transform pipeline on a TSV or CSV | +| json | Run a transform pipeline on a multi line json file | +| avro | Load data from avro file | + +## Example + +```yaml +inputs: + pubmedRead: + glob: + path: "{{params.baseline}}/*.xml.gz" + xml: {} + +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/index.md b/docs/tools/sifter/docs/inputs/index.md new file mode 100644 index 0000000..5b16e97 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/index.md @@ -0,0 +1,53 @@ +--- +title: Inputs +render_macros: false +--- + +Every playbook has a section of **input loaders** – components that read raw data (files, APIs, databases, etc.) and convert it into Python objects for downstream steps. +An *input* can accept user‑supplied values passed by the **params** section. + +## Common input types + +* `table` – extracts data from tabular files (TSV/CSV) +* `avro` – loads an Avro file (see `docs/docs/inputs/avro.md`) +* `json`, `csv`, `sql`, etc. + +## Example – `table` + +The `table` loader is a good starting point because it demonstrates the typical parameter set required by most inputs. See the full specification in `docs/docs/inputs/table.md`: + +```yaml +params: + gafFile: + type: File + value: ../../source/go/goa_human.gaf.gz + +inputs: + gafLoad: + tableLoad: + path: "{{params.gafFile}}" + columns: + - db + - id + - symbol + - qualifier + - goID + - reference + - evidenceCode + - from + - aspect + - name + - synonym + - objectType + - taxon + - date + - assignedBy + - extension + - geneProduct +``` + +When you run the playbook you can override any of these parameters, e.g.: + +```bash +sifter run gatplaybook.yaml --param gafFile=/tmp/mydata.tsv +``` diff --git a/docs/tools/sifter/docs/inputs/json.md b/docs/tools/sifter/docs/inputs/json.md new file mode 100644 index 0000000..58f4f9b --- /dev/null +++ b/docs/tools/sifter/docs/inputs/json.md @@ -0,0 +1,24 @@ +--- +title: json +render_macros: false +--- + +# json +Load data from a JSON file. Default behavior expects a single dictionary per line. Each line is a seperate entry. The `multiline` parameter reads all of the lines of the files and returns a single object. + +## Parameters + +| name | Description | +| --- | --- | +| path | Path of JSON file to transform | +| multiline | Load file as a single multiline JSON object | + + +## Example + +```yaml +inputs: + caseData: + json: + path: "{{params.casesJSON}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/plugin.md b/docs/tools/sifter/docs/inputs/plugin.md new file mode 100644 index 0000000..9037595 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/plugin.md @@ -0,0 +1,71 @@ +--- +title: input plugin +render_macros: false +--- + +# plugin +Run user program for customized data extraction. + +## Example + +```yaml +inputs: + oboData: + plugin: + commandLine: ../../util/obo_reader.py {{params.oboFile}} +``` + +The plugin program is expected to output JSON messages, one per line, to STDOUT that will then +be passed to the transform pipelines. + +## Example Plugin +The `obo_reader.py` plugin, it reads a OBO file, such as the kind the describe the GeneOntology, and emits the +records as single line JSON messages. +```python + #!/usr/bin/env python + +import re +import sys +import json + +re_section = re.compile(r'^\[(.*)\]') +re_field = re.compile(r'^(\w+): (.*)$') + +def obo_parse(handle): + rec = None + for line in handle: + res = re_section.search(line) + if res: + if rec is not None: + yield rec + rec = None + if res.group(1) == "Term": + rec = {"type": res.group(1)} + else: + if rec is not None: + res = re_field.search(line) + if res: + key = res.group(1) + val = res.group(2) + val = re.split(" ! | \(|\)", val) + val = ":".join(val[0:3]) + if key in rec: + rec[key].append(val) + else: + rec[key] = [val] + + if rec is not None: + yield rec + + +def unquote(s): + res = re.search(r'"(.*)"', s) + if res: + return res.group(1) + return s + + +with open(sys.argv[1]) as handle: + for rec in obo_parse(handle): + print(json.dumps(rec)) +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/sqldump.md b/docs/tools/sifter/docs/inputs/sqldump.md new file mode 100644 index 0000000..5fdbd9a --- /dev/null +++ b/docs/tools/sifter/docs/inputs/sqldump.md @@ -0,0 +1,31 @@ +--- +title: sqldump +render_macros: false +--- + +# sqlDump +Scan file produced produced from sqldump. + +## Parameters + +| Name | Type | Description | +|-------|---|--------| +| path | string | Path to the SQL dump file | +| tables | []string | Names of tables to read out | + +## Example + +```yaml +inputs: + database: + sqldumpLoad: + path: "{{params.sql}}" + tables: + - cells + - cell_tissues + - dose_responses + - drugs + - drug_annots + - experiments + - profiles +``` diff --git a/docs/tools/sifter/docs/inputs/sqlite.md b/docs/tools/sifter/docs/inputs/sqlite.md new file mode 100644 index 0000000..7cd96e6 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/sqlite.md @@ -0,0 +1,27 @@ +--- +title: sqlite +render_macros: false +--- + +# sqlite + +Extract data from an sqlite file + +## Parameters + +| Name | Type | Description | +|-------|---|--------| +| path | string | Path to the SQLite file | +| query | string | SQL select statement based input | + +## Example + +```yaml + +inputs: + sqlQuery: + sqliteLoad: + path: "{{params.sqlite}}" + query: "select * from drug_mechanism as a LEFT JOIN MECHANISM_REFS as b on a.MEC_ID=b.MEC_ID LEFT JOIN TARGET_COMPONENTS as c on a.TID=c.TID LEFT JOIN COMPONENT_SEQUENCES as d on c.COMPONENT_ID=d.COMPONENT_ID LEFT JOIN MOLECULE_DICTIONARY as e on a.MOLREGNO=e.MOLREGNO" + +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/table.md b/docs/tools/sifter/docs/inputs/table.md new file mode 100644 index 0000000..a931eb5 --- /dev/null +++ b/docs/tools/sifter/docs/inputs/table.md @@ -0,0 +1,53 @@ +--- +title: table +render_macros: false +--- + +# table + +Extract data from tabular file, includiong TSV and CSV files. + +## Parameters + +| Name | Type | Description | +|-------|---|--------| +| path | string | File to be transformed | +| rowSkip | int | Number of header rows to skip | +| columns | []string | Manually set names of columns | +| extraColumns | string | Columns beyond originally declared columns will be placed in this array | +| sep | string | Separator \\t for TSVs or , for CSVs | + + +## Example + +```yaml + +params: + gafFile: + default: ../../source/go/goa_human.gaf.gz + type: File + +inputs: + gafLoad: + tableLoad: + path: "{{params.gafFile}}" + columns: + - db + - id + - symbol + - qualifier + - goID + - reference + - evidenceCode + - from + - aspect + - name + - synonym + - objectType + - taxon + - date + - assignedBy + - extension + - geneProduct + +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/inputs/xml.md b/docs/tools/sifter/docs/inputs/xml.md new file mode 100644 index 0000000..41aeb0e --- /dev/null +++ b/docs/tools/sifter/docs/inputs/xml.md @@ -0,0 +1,22 @@ +--- +title: xml +render_macros: false +--- + +# xml +Load an XML file + +## Parameters + +| name | Description | +| --- | --- | +| path | Path to input file | + +## Example + +```yaml +inputs: + loader: + xmlLoad: + path: "{{params.xmlPath}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/outputs/graphBuild.md b/docs/tools/sifter/docs/outputs/graphBuild.md new file mode 100644 index 0000000..f71ade6 --- /dev/null +++ b/docs/tools/sifter/docs/outputs/graphBuild.md @@ -0,0 +1,16 @@ +--- +title: graphBuild +render_macros: false +--- + +# Output: graphBuild + +Build graph elements from JSON objects using the JSON Schema graph extensions. + + +# example +```yaml + - graphBuild: + schema: "{{params.allelesSchema}}" + title: Allele +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/outputs/json.md b/docs/tools/sifter/docs/outputs/json.md new file mode 100644 index 0000000..d6e93e2 --- /dev/null +++ b/docs/tools/sifter/docs/outputs/json.md @@ -0,0 +1,26 @@ +--- +title: json +menu: + main: + parent: transforms + weight: 100 +--- + +# Output: json + +Send data to output file. The naming of the file is `outdir`/`path` + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| path | string | Path to output file | + +## example + +```yaml +output: + outfile: + json: + path: protein_compound_association.ndjson +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/outputs/tableWrite.md b/docs/tools/sifter/docs/outputs/tableWrite.md new file mode 100644 index 0000000..7a39648 --- /dev/null +++ b/docs/tools/sifter/docs/outputs/tableWrite.md @@ -0,0 +1,7 @@ +--- +title: tableWrite +menu: + main: + parent: transforms + weight: 100 +--- diff --git a/docs/tools/sifter/docs/schema.md b/docs/tools/sifter/docs/schema.md new file mode 100644 index 0000000..b4951ba --- /dev/null +++ b/docs/tools/sifter/docs/schema.md @@ -0,0 +1,149 @@ +--- +title: Schema +--- + +# Sifter Playbook Schema + +This document provides a comprehensive description of the Sifter Playbook format, its input methods (extractors), and its transformation steps. + +## Playbook Structure + +A Playbook is a YAML file that defines an ETL pipeline. + +| Field | Type | Description | +| :--- | :--- | :--- | +| `class` | string | Should be `sifter`. | +| `name` | string | Unique name of the playbook. | +| `docs` | string | Documentation string for the playbook. | +| `outdir` | string | Default output directory for emitted files. | +| `params` | map | Configuration variables with optional defaults and types (`File`, `Dir`). | +| `inputs` | map | Named extractor definitions. | +| `outputs` | map | Named outputs definitions. | +| `pipelines` | map | Named transformation pipelines (arrays of steps). | + +--- + +## Parameters (`params`) + + +Parameters allow playbooks to be parameterized. They are defined in the `params` section of the playbook YAML file. + +### Params Syntax +```yaml +params: + variableName: + type: File # or Dir + default: "path/to/default" +``` + +### Supported Types +- `File`: Represents a file path +- `Dir`: Represents a directory path + +```yaml +params: + inputDir: + type: Dir + default: "./data/input" + outputDir: + type: Dir + default: "./data/output" + schemaFile: + type: File + default: "./config/schema.json" +``` + + +## Input Methods (Extractors) + +Extractors produce a stream of messages from various sources. + +### `table` +Loads data from a delimited file (TSV/CSV). +- `path`: Path to the file. +- `rowSkip`: Number of header rows to skip. +- `columns`: Optional list of column names. +- `extraColumns`: Field name to store any columns beyond the declared ones. +- `sep`: Separator (default `\t` for TSVs, `,` for CSVs). + +### `json` +Loads data from a JSON file (standard or line-delimited). +- `path`: Path to the file. +- `multiline`: Load file as a single multiline JSON object. + +### `avro` +Loads data from an Avro object file. +- `path`: Path to the file. + +### `xml` +Loads and parses XML data. +- `path`: Path to the file. +- `level`: Depth level to start breaking XML into discrete messages. + +### `sqlite` +Loads data from a SQLite database. +- `path`: Path to the database file. +- `query`: SQL SELECT statement. + +### `transpose` +Loads a TSV and transposes it (making rows from columns). +- `input`: Path to the file. +- `rowSkip`: Rows to skip. +- `sep`: Separator. +- `useDB`: Use a temporary disk database for large transpositions. + +### `plugin` (Extractor) +Runs an external command that produces JSON messages to stdout. +- `commandLine`: The command to execute. + +### `embedded` (Extractor) +Load data from embedded structure. +- No parameters required. + +### `glob` (Extractor) +Scan files using `*` based glob statement and open all files as input. +- `path`: Path of avro object file to transform. +- `storeFilename`: Store value of filename in parameter each row. +- `xml`: xmlLoad data. +- `table`: Run transform pipeline on a TSV or CSV. +- `json`: Run a transform pipeline on a multi line json file. +- `avro`: Load data from avro file. + +--- + +## Transformation Steps + +Transformation pipelines are arrays of steps. Each step can be one of the following: + +### Core Processing +- `from`: Start a pipeline from a named input or another pipeline. +- `emit`: Write messages to a JSON file. Fields: `name`, `useName` (bool). +- `objectValidate`: Validate messages against a JSON schema. Fields: `title`, `schema` (directory), `uri`. +- `debug`: Print message contents to stdout. Fields: `label`, `format`. +- `plugin` (Transform): Pipe messages through an external script via stdin/stdout. Fields: `commandLine`. + +### Mapping and Projection +- `project`: Map templates into new fields. Fields: `mapping` (key-template pairs), `rename` (simple rename). +- `map`: Apply a Python/GPython function to each record. Fields: `method` (function name), `python` (code string), `gpython` (path or code). +- `flatMap`: Similar to `map`, but flattens list responses into multiple messages. +- `fieldParse`: Parse a string field (e.g. `key1=val1;key2=val2`) into individual keys. Fields: `field`, `sep`. +- `fieldType`: Cast fields to specific types (`int`, `float`, `list`). Represented as a map of `fieldName: type`. + +### Filtering and Cleaning +- `filter`: Drop messages based on criteria. Fields: `field`, `value`, `match`, `check` (`exists`/`hasValue`/`not`), or `python`/`gpython` code. +- `clean`: Remove fields. Fields: `fields` (list of kept fields), `removeEmpty` (bool), `storeExtra` (target field for extras). +- `dropNull`: Remove fields with `null` values from a message. +- `distinct`: Only emit messages with a unique value once. Field: `value` (template). + +### Grouping and Lookups +- `reduce`: Merge messages sharing a key. Fields: `field` (key), `method`, `python`/`gpython`, `init` (initial data). +- `accumulate`: Group all messages sharing a key into a list. Fields: `field` (key), `dest` (target list field). +- `lookup`: Join data from external files (TSV/JSON). Fields: `tsv`, `json`, `replace`, `lookup`, `copy` (mapping of fields to copy). +- `intervalIntersect`: Match genomic intervals. Fields: `match` (CHR), `start`, `end`, `field` (dest), `json` (source file). + +### Specialized +- `hash`: Generate a hash of a field. Fields: `field` (dest), `value` (template), `method` (`md5`, `sha1`, `sha256`). +- `uuid`: Generate a UUID. Fields: `field`, `value` (seed), `namespace`. +- `graphBuild`: Convert messages into graph vertices and edges using schema definitions. Fields: `schema`, `title`. +- `tableWrite`: Write specific fields to a delimited output file. Fields: `output`, `columns`, `sep`, `header`, `skipColumnHeader`. +- `split`: Split a single message into multiple based on a list field. diff --git a/docs/tools/sifter/docs/transforms/accumulate.md b/docs/tools/sifter/docs/transforms/accumulate.md new file mode 100644 index 0000000..3d7439e --- /dev/null +++ b/docs/tools/sifter/docs/transforms/accumulate.md @@ -0,0 +1,26 @@ +--- +title: accumulate +menu: + main: + parent: transforms + weight: 100 +--- + +# accumulate + +Gather sequential rows into a single record, based on matching a field + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string (field path) | Field used to match rows | +| dest | string | field to store accumulated records | + +## Example + +``` + - accumulate: + field: model_id + dest: rows +``` diff --git a/docs/tools/sifter/docs/transforms/clean.md b/docs/tools/sifter/docs/transforms/clean.md new file mode 100644 index 0000000..fcff2c0 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/clean.md @@ -0,0 +1,28 @@ +--- +title: clean +menu: + main: + parent: transforms + weight: 100 +--- + +# clean + +Remove fields that don't appear in the desingated list. + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| fields | [] string | Fields to keep | +| removeEmpty | bool | Fields with empty values will also be removed | +| storeExtra | string | Field name to store removed fields | + +## Example + +```yaml + - clean: + fields: + - id + - synonyms +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/debug.md b/docs/tools/sifter/docs/transforms/debug.md new file mode 100644 index 0000000..e8479aa --- /dev/null +++ b/docs/tools/sifter/docs/transforms/debug.md @@ -0,0 +1,21 @@ +--- +title: debug +--- + +# debug + +Print out copy of stream to logging + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| label | string | Label for log output | +| format | bool | Use multiline spaced output | + + +# Example + +```yaml + - debug: {} +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/distinct.md b/docs/tools/sifter/docs/transforms/distinct.md new file mode 100644 index 0000000..b1d287a --- /dev/null +++ b/docs/tools/sifter/docs/transforms/distinct.md @@ -0,0 +1,20 @@ +--- +title: distinct +render_macros: false +--- + +# distinct +Using templated value, allow only the first record for each distinct key + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| value | string | Key used for distinct value | + +## Example + +```yaml + - distinct: + value: "{{row.key}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/fieldParse.md b/docs/tools/sifter/docs/transforms/fieldParse.md new file mode 100644 index 0000000..4be41ff --- /dev/null +++ b/docs/tools/sifter/docs/transforms/fieldParse.md @@ -0,0 +1,26 @@ +--- +title: fieldParse +menu: + main: + parent: transforms + weight: 100 +--- + +# fieldParse + +Parse a string field (e.g. `key1=val1;key2=val2`) into individual keys. + +## Parameters + +| Name | Type | Description | +| --- | --- | --- | +| field | string | The field containing the string to be parsed | +| sep | string | Separator character used to split the string | + +## Example + +```yaml + - fieldParse: + field: attributes + sep: ";" +``` diff --git a/docs/tools/sifter/docs/transforms/fieldProcess.md b/docs/tools/sifter/docs/transforms/fieldProcess.md new file mode 100644 index 0000000..7ed2bae --- /dev/null +++ b/docs/tools/sifter/docs/transforms/fieldProcess.md @@ -0,0 +1,29 @@ +--- +title: fieldProcess +render_macros: false +--- + + +# fieldProcess + +Create stream of objects based on the contents of a field. If the selected field is an array +each of the items in the array will become an independent row. + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string | Name of field to be processed | +| mapping | map[string]string | Project templated values into child element | +| itemField | string | If processing an array of non-dict elements, create a dict as `{itemField:element}` | + + +## example + +```yaml + - fieldProcess: + field: portions + mapping: + sample: "{{row.sample_id}}" + project_id: "{{row.project_id}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/fieldType.md b/docs/tools/sifter/docs/transforms/fieldType.md new file mode 100644 index 0000000..6b520bb --- /dev/null +++ b/docs/tools/sifter/docs/transforms/fieldType.md @@ -0,0 +1,27 @@ +--- +title: fieldType +menu: + main: + parent: transforms + weight: 100 +--- + +# fieldType + +Set field to specific type, ie cast as float or integer + + + +# example +```yaml + + - fieldType: + t_depth: int + t_ref_count: int + t_alt_count: int + n_depth: int + n_ref_count: int + n_alt_count: int + start: int + +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/filter.md b/docs/tools/sifter/docs/transforms/filter.md new file mode 100644 index 0000000..f767893 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/filter.md @@ -0,0 +1,40 @@ +--- +title: filter +menu: + main: + parent: transforms + weight: 100 +--- + +# filter + +Filter rows in stream using a number of different methods + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string (field path) | Field used to match rows | +| value | string (template string) | Template string to match against | +| match | string | String to match against | +| check | string | How to check value, 'exists' or 'hasValue' | +| method | string | Method name | +| python | string | Python code string | +| gpython | string | Python code string run using (https://github.com/go-python/gpython) | + +## Example + +Field based match +```yaml + - filter: + field: table + match: source_statistics +``` + + +Check based match +```yaml + - filter: + field: uniprot + check: hasValue +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/flatmap.md b/docs/tools/sifter/docs/transforms/flatmap.md new file mode 100644 index 0000000..6de0ed3 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/flatmap.md @@ -0,0 +1,43 @@ +--- +title: flatMap +render_macros: false +--- + +# flatMap + +Flatten an array field into separate messages, each containing a single element of the array. + +## Parameters + +| Parameter | Type | Description | +|-----------|--------|------------| +| `field` | string | Path to the array field to be flattened (e.g., `{{row.samples}}`). | +| `dest` | string | Optional name of the field to store the flattened element (defaults to the same field name). | +| `keep` | bool | If `true`, keep the original array alongside the flattened messages. | + +## Example + +```yaml +- flatMap: + field: "{{row.samples}}" + dest: sample +``` + +Given an input message: + +```json +{ "id": "P001", "samples": ["S1", "S2", "S3"] } +``` + +The step emits three messages: + +```json +{ "id": "P001", "sample": "S1" } +{ "id": "P001", "sample": "S2" } +{ "id": "P001", "sample": "S3" } +``` + +## See also + +- [filter](filter.md) – conditionally emit messages. +- [map](map.md) – apply a function to each flattened message. diff --git a/docs/tools/sifter/docs/transforms/from.md b/docs/tools/sifter/docs/transforms/from.md new file mode 100644 index 0000000..da38940 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/from.md @@ -0,0 +1,25 @@ +--- +title: from +menu: + main: + parent: transforms + weight: 100 +--- + +# from + +Start a pipeline from a named input or another pipeline. + +## Parameters + +| Name | Type | Description | +| --- | --- | --- | +| source | string | Name of the input or pipeline to start from | + +## Example + +```yaml +pipelines: + profileProcess: + - from: profileReader +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/hash.md b/docs/tools/sifter/docs/transforms/hash.md new file mode 100644 index 0000000..9c036ef --- /dev/null +++ b/docs/tools/sifter/docs/transforms/hash.md @@ -0,0 +1,24 @@ +--- +title: hash +render_macros: false +--- + +# hash + + +# Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string | Field to store hash value | +| value | string | Templated string of value to be hashed | +| method | string | Hashing method: sha1/sha256/md5 | + +# example + +```yaml + - hash: + value: "{{row.contents}}" + field: contents-sha1 + method: sha1 +``` diff --git a/docs/tools/sifter/docs/transforms/lookup.md b/docs/tools/sifter/docs/transforms/lookup.md new file mode 100644 index 0000000..08cb802 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/lookup.md @@ -0,0 +1,81 @@ +--- +title: lookup +render_macros: false +--- + +# lookup +Using key from current row, get values from a reference source + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| replace | string (field path) | Field to replace | +| lookup | string (template string) | Key to use for looking up data | +| copy | map[string]string | Copy values from record that was found by lookup. The Key/Value record uses the Key as the destination field and copies the field from the retrieved records using the field named in Value | +| tsv | TSVTable | TSV translation table file | +| json | JSONTable | JSON data file | +| table | LookupTable | Inline lookup table | +| pipeline | PipelineLookup | Use output of a pipeline as a lookup table | + +## Example + +### JSON file based lookup + +The JSON file defined by `params.doseResponseFile` is opened and loaded into memory, using the `experiment_id` field as a primary key. + +```yaml + - lookup: + json: + input: "{{params.doseResponseFile}}" + key: experiment_id + lookup: "{{row.experiment_id}}" + copy: + curve: curve +``` + + +### Pipeline output lookup + +Prepare a table in the pipelines `tableGen`. Then in `recordProcess` use that table, indexed by the field `primary_key` and lookup the value `{{row.table_id}}` to copy in the contents of the `other_data` field from the table and add it to the row as `my_data`. + +```yaml + +pipelines: + + tableGen: + - from: dataFile + #some set of transforms to prepair data + #records look like { "primary_key" : "bob", "other_data": "red" } + + recordProcess: + - from: recordFile + - lookup: + pipeline: + from: tableGen + key: primary_key + lookup: "{{row.table_id}}" + copy: + my_data: other_data + +``` + +#### Example data: +tableGen +```yaml +{ "primary_key" : "bob", "other_data": "red" } +{ "primary_key" : "alice", "other_data": "blue" } +``` + +recordProcess input +```yaml +{"id" : "record_1", "table_id":"alice" } +{"id" : "record_2", "table_id":"bob" } +``` + +recordProcess output +```yaml +{"id" : "record_1", "table_id":"alice", "my_data" : "blue" } +{"id" : "record_2", "table_id":"bob", "my_data" : "red" } +``` + diff --git a/docs/tools/sifter/docs/transforms/map.md b/docs/tools/sifter/docs/transforms/map.md new file mode 100644 index 0000000..6f7ed3d --- /dev/null +++ b/docs/tools/sifter/docs/transforms/map.md @@ -0,0 +1,40 @@ +--- +title: map +menu: + main: + parent: transforms + weight: 100 +--- + +# map + +Run function on every row + +## Parameters + +| name | Description | +| --- | --- | +| method | Name of function to call | +| python | Python code to be run | +| gpython | Python code to be run using GPython| + +## Example + +```yaml + - map: + method: response + gpython: | + def response(x): + s = sorted(x["curve"].items(), key=lambda x:float(x[0])) + x['dose_um'] = [] + x['response'] = [] + for d, r in s: + try: + dn = float(d) + rn = float(r) + x['dose_um'].append(dn) + x['response'].append(rn) + except ValueError: + pass + return x +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/objectValidate.md b/docs/tools/sifter/docs/transforms/objectValidate.md new file mode 100644 index 0000000..c34d45b --- /dev/null +++ b/docs/tools/sifter/docs/transforms/objectValidate.md @@ -0,0 +1,23 @@ +--- +title: objectValidate +render_macros: false +--- + +# objectValidate + +Use JSON schema to validate row contents + +# parameters + +| name | Type | Description | +| --- | --- | --- | +| title | string | Title of object to use for validation | +| schema | string | Path to JSON schema definition | + +# example + +``` + - objectValidate: + title: Aliquot + schema: "{{params.schema}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/plugin.md b/docs/tools/sifter/docs/transforms/plugin.md new file mode 100644 index 0000000..8de6b4a --- /dev/null +++ b/docs/tools/sifter/docs/transforms/plugin.md @@ -0,0 +1,55 @@ +--- +title: transform plugin +menu: + main: + parent: transforms + weight: 100 +--- + +# plugin + +Invoke external program for data processing + +## Parameters + +| name | Description | +| --- | --- | +| commandLine | Command line program to be called | + +The command line can be written in any language. Sifter and the +plugin communicate via NDJSON. Sifter streams the input to the program via +STDIN and the plugin returns results via STDOUT. Any loggin or additional +data must be sent to STDERR, or it will interupt the stream of messages. +The command line code is executed using the base directory of the +sifter file as the working directory. + +## Example + +```yaml + - plugin: + commandLine: "../../util/calc_fingerprint.py" +``` + +In this case, the plugin code is + +```python +#!/usr/bin/env python + +import sys +import json +from rdkit import Chem +from rdkit.Chem import AllChem + +for line in sys.stdin: + row = json.loads(line) + if "canonical_smiles" in row: + smiles = row["canonical_smiles"] + m = Chem.MolFromSmiles(smiles) + try: + fp = AllChem.GetMorganFingerprintAsBitVect(m, radius=2) + fingerprint = list(fp) + row["morgan_fingerprint_2"] = fingerprint + except: + pass + print(json.dumps(row)) +``` diff --git a/docs/tools/sifter/docs/transforms/project.md b/docs/tools/sifter/docs/transforms/project.md new file mode 100644 index 0000000..6ad7cc9 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/project.md @@ -0,0 +1,26 @@ +--- +title: project +render_macros: false +--- + +# project + +Populate row with templated values + + +# parameters + +| name | Type | Description | +| --- | --- | --- | +| mapping | map[string]any | New fields to be generated from template | +| rename | map[string]string | Rename field (no template engine) | + + +# Example + +```yaml + - project: + mapping: + type: sample + id: "{{row.sample_id}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/reduce.md b/docs/tools/sifter/docs/transforms/reduce.md new file mode 100644 index 0000000..77d18bc --- /dev/null +++ b/docs/tools/sifter/docs/transforms/reduce.md @@ -0,0 +1,35 @@ +--- +title: reduce +menu: + main: + parent: transforms + weight: 100 +--- + +# reduce + +Using key from rows, reduce matched records into a single entry + +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string (field path) | Field used to match rows | +| method | string | Method name | +| python | string | Python code string | +| gpython | string | Python code string run using (https://github.com/go-python/gpython) | +| init | map[string]any | Data to use for first reduce | + +## Example + +```yaml + - reduce: + field: dataset_name + method: merge + init: { "compounds" : [] } + gpython: | + + def merge(x,y): + x["compounds"] = list(set(y["compounds"]+x["compounds"])) + return x +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/regexReplace.md b/docs/tools/sifter/docs/transforms/regexReplace.md new file mode 100644 index 0000000..51057e1 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/regexReplace.md @@ -0,0 +1,7 @@ +--- +title: regexReplace +menu: + main: + parent: transforms + weight: 100 +--- diff --git a/docs/tools/sifter/docs/transforms/split.md b/docs/tools/sifter/docs/transforms/split.md new file mode 100644 index 0000000..ffcebf4 --- /dev/null +++ b/docs/tools/sifter/docs/transforms/split.md @@ -0,0 +1,25 @@ +--- +title: split +menu: + main: + parent: transforms + weight: 100 +--- + +# split + +Split a field using string `sep` +## Parameters + +| name | Type | Description | +| --- | --- | --- | +| field | string | Field to the split | +| sep | string | String to use for splitting | + +## Example + +```yaml + - split: + field: methods + sep: ";" +``` \ No newline at end of file diff --git a/docs/tools/sifter/docs/transforms/uuid.md b/docs/tools/sifter/docs/transforms/uuid.md new file mode 100644 index 0000000..204c20f --- /dev/null +++ b/docs/tools/sifter/docs/transforms/uuid.md @@ -0,0 +1,24 @@ +--- +title: uuid +render_macros: false +--- + +# uuid + +Generate a UUID for a field. + +## Parameters + +| Name | Type | Description | +| --- | --- | --- | +| field | string | Destination field name for the UUID | +| value | string | Seed value used to generate the UUID | +| namespace | string | UUID namespace (optional) | + +## Example + +```yaml + - uuid: + field: id + value: "{{row.name}}" +``` \ No newline at end of file diff --git a/docs/tools/sifter/index.md b/docs/tools/sifter/index.md new file mode 100644 index 0000000..30ca242 --- /dev/null +++ b/docs/tools/sifter/index.md @@ -0,0 +1,191 @@ +--- +title: Sifter +render_macros: false +--- + + +# Sifter + +Sifter is a stream based processing engine. It comes with a number of +file extractors that operate as inputs to these pipelines. The pipeline engine +connects togeather several processing data into directed acylic graph that is processed +in parallel. + +Example Message: + +```json +{ + "firstName" : "bob", + "age" : "25" + "friends" : [ "Max", "Alex"] +} +``` + +Once a stream of messages are produced, that can be run through a transform +pipeline. A transform pipeline is an array of transform steps, each transform +step can represent a different way to alter the data. The array of transforms link +togeather into a pipe that makes multiple alterations to messages as they are +passed along. There are a number of different transform steps types that can +be done in a transform pipeline these include: + + - Projection: creating new fields using a templating engine driven by existing values + - Filtering: removing messages + - Programmatic transformation: alter messages using an embedded python interpreter + - Table based field translation + - Outputing the message as a JSON Schema checked object + + +# Pipeline File + +An sifter pipeline file is in YAML format and describes an entire processing pipelines. +If is composed of the following sections: `params`, `inputs`, `pipelines`, `outputs`. In addition, +for tracking, the file will also include `name` and `class` entries. + +```yaml + +class: sifter +name: