Skip to content

mpc-bioinformatics/autoQuaC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autoQuaC — automated QC pipeline orchestration

autoQuaC Watcher monitors one or more input folders for new mass-spectrometry files (e.g., *.raw), waits until candidates are stable (unchanged size across consecutive scans), copies them into a temporary working directory, generates a mcquac.json params file from a template, and launches Nextflow using the Docker profile. On completion, it delivers results to the configured output folder and prevents reprocessing.

In short: drop → process → deliver → don’t reprocess.


Table of contents


Features

  • Watcher for any number of input folders (supports glob patterns like *std.raw).
  • Robust candidate check: only files that remain unchanged (same size in ≥2 consecutive scans) are picked up.
  • Automatic job creation:
    • copies candidates to tmp/<hash>/input/
    • generates mcquac.json from a template
    • injects FASTA & spike-in paths and replaces placeholders
  • Nextflow runner (Docker profile) with status markers (.ready, .working, .finish, .error) and Nextflow logs.
  • Post-processing & delivery:
    • on success: selects the best *.hdf5 from tmp/<hash>/output/** and writes <output>/<SRC_STEM>.hdf5
    • on failure: if no .hdf5 exists, writes <output>/<SRC_STEM>.error.log (copied from .nextflow.log)
    • updates ignore.txt to prevent reprocessing
  • Optional SMB mounts with version fallback (tries SMB 3.1.1 → 3.0 → 2.1).
  • Per-io_pair McQuaC template: each io_pairs[*] entry can optionally define mcquac_template (default: mcquac.json).
  • Robust SMB handling: mount checks and automatic remount/retry attempts are performed before scanning inputs and before delivering results to the output share.

Requirements

  • Debian/Ubuntu-like system (WSL works as well).
  • Docker Engine + Docker Compose plugin.
  • Java ≥ 11 (required by Nextflow).
  • Python 3.10+ (uses modern type annotations like X | Y).
  • For SMB/network shares: cifs-utils (for mount.cifs).

The included setup.sh can help install prerequisites, download Nextflow into the project, clone the McQuaC repository, and generate/update config/app.json.


Quick start

  1. Prepare the repo (if not done yet):
    chmod +x ./setup.sh
    ./setup.sh
    
    

The script can install Docker/Java (if missing), download Nextflow locally, clone McQuaC, and create/update config/app.json with proper paths.

  1. Adjust configuration:

    • Review config/app.json (see below), especially mcquac_path, nextflow_bin, io_pairs, and optional mounts.
    • Place your FASTA under config/fasta/ (top level, e.g., human.fasta).
    • Place your spike-in table under config/spike/ (top level, *.csv).
    • (Optional) Add additional templates under config/ (e.g., mcquac_dda.json) and select them per io_pair.
  2. Start the watcher:

    python3 main.py
    # or make it executable: ./main.py
  3. Check results:

    • Logs: tmp/<hash>/logs/nextflow-YYYYMMDD-HHMMSS.log (and .nextflow.log in tmp/<hash>/)
    • On success: <output>/<SRC_STEM>.hdf5
    • On error (no .hdf5 produced): <output>/<SRC_STEM>.error.log

Configuration

config/app.json

Minimal example:

{
  "interval_minutes": 6,
  "default_pattern": "*std.raw",
  "mcquac_path": "/path/to/McQuaC/main.nf",
  "nextflow_bin": "/path/to/nextflow",
  "mounts": [],
  "continue_on_mount_error": false,
  "unmount_on_exit": true,
  "io_pairs": [
    {
      "input": "/data/in",
      "output": "/data/out",
      "pattern": "*std.raw",
      "mcquac_template": "mcquac_dda.json"
    }
  ]
}

Fields:

  • interval_minutes — watcher polling interval in minutes (internally converted to seconds).

  • default_pattern — glob used when an io_pair doesn’t specify its own pattern.

  • mcquac_path — path to the McQuaC main.nf.

  • nextflow_bin (optional) — path to the Nextflow binary resolution order: $NEXTFLOW_BINapp.json:nextflow_bin → local ./nextflowPATH.

  • mounts (optional) — SMB share definitions (see below).

  • continue_on_mount_error (optional, bool) — continue even if a mount fails.

  • unmount_on_exit (optional, bool) — unmount shares when shutting down.

  • io_pairs — list of objects: { input, output, pattern?, mcquac_template? }

    • input (string) — source folder to scan for new files
    • output (string) — target folder where final results are written
    • pattern (string, optional) — glob pattern to filter input files
    • mcquac_template (string, optional) — filename of a McQuaC template located in ./config/ (default: mcquac.json)

SMB / network shares (mounts)

Example with domain and extra options:

{
  "mounts": [
    {
      "name": "archive",
      "host": "192.168.10.20",
      "share": "Archiv",
      "mountpoint": "/mnt/archive",
      "username": "user123",
      "password": "SECRET",
      "domain": "ACME",
      "vers": null,
      "file_mode": "0664",
      "dir_mode": "0775",
      "extra_opts": ["noserverino"]
    }
  ]
}

Notes:

  • Mounting requires root privileges (run the watcher with sudo) and cifs-utils.
  • The mounter checks mountpoint health and tries SMB versions 3.1.1 → 3.0 → 2.1.
  • Security: passwords are stored in clear text in config/app.json. Restrict access to the config/ folder accordingly.

Input/output pairs (io_pairs)

Each pair defines a watched folder (input) and a final target (output).

  • The target folder maintains an ignore.txt file. After a successful run, the source base filename is appended there to prevent re-processing.

    • Remove a line from ignore.txt to force a re-run for that file.
  • Each watcher thread also keeps a thread-local ignore file under tmp/ to avoid duplicates while the process is running.


Per-io_pair McQuaC template (mcquac_template)

Each io_pair can optionally specify its own McQuaC configuration template via mcquac_template, for example:

"mcquac_template": "mcquac_dda.json"

Rules:

  • The template file must exist under ./config/<template>.json (example: ./config/mcquac_dda.json).
  • If mcquac_template is not set, ./config/mcquac.json is used by default.

Templates are plain JSON files and may contain placeholders (see below).


FASTA & spike-in

Put files at the top level of these folders:

config/
├── fasta/   # *.fasta (the "newest" file is used automatically)
└── spike/   # *.csv   (same rule)

During job creation:

  • the newest *.fasta is injected as main_fasta_file
  • the newest *.csv is injected as main_spike_file

Supported placeholders inside the template (strings are replaced automatically):

  • %%%INPUT%%% / %%%OUTPUT%%% (template → job params)
  • %%%FASTA%%% (and legacy variants) → resolved FASTA path
  • %%%SPIKE%%% (and legacy variants) → resolved spike-in CSV path

How it works

  1. Watch — For each io_pair, a thread scans its input folder on an interval (interval_minutes). A file becomes a candidate only if it appears with the same size in at least two consecutive scans and is not matched by either ignore file (thread-local tmp/... and output/ignore.txt).

  2. Copy & job — Each candidate is copied to tmp/<hash>/input/. For each hash, the system creates:

    • mcquac.json (generated from ./config/<template>.json, where <template> is taken from io_pair.mcquac_template or defaults to mcquac.json; FASTA/spike-in paths are injected and placeholders are replaced)
    • info.json (metadata: paths, source, watch context)
    • .ready (signal for the runner)
  3. Run — The Nextflow runner consumes jobs from tmp/*/.ready, resolves Nextflow (order: $NEXTFLOW_BINapp.json:nextflow_bin → local ./nextflowPATH), and starts:

    nextflow run -profile docker <mcquac main.nf> -params-file mcquac.json

    Logs are written to tmp/<hash>/logs/. Status files are updated from .ready.working.finish (or .error).

  4. Post-processing & delivery

    • Success (rc == 0)

      • Searches tmp/<hash>/output/** for *.hdf5 (recursively) and selects the best one (newest, then largest).
      • Copies it directly to the final output folder as: <output>/<SRC_STEM>.hdf5.
      • Updates ignore.txt.
      • Empties tmp/<hash>/{input,output,work} and marks the job as delivered (.delivered).
    • Failure (rc != 0)

      • If no .hdf5 was produced, copies tmp/<hash>/.nextflow.log to: <output>/<SRC_STEM>.error.log.
      • Marks the job as delivered (.delivered) once the log copy succeeds.

    If the output share is temporarily unavailable (e.g., SMB disconnect), mount checks/remount attempts are performed and delivery is retried until it succeeds.

Important: main.py currently clears ./tmp on startup. If you want to survive restarts and keep “pending delivery” jobs after a reboot, comment out the nuke_tmp() call in main.py.


Run it

  • Direct:

    python3 main.py
  • With a virtual env (optional):

    python3 -m venv .venv
    source .venv/bin/activate
    python3 main.py
  • Test Nextflow (optional):

    ./nextflow run hello
    ./nextflow run hello -with-docker

Run as a systemd service

You can run MCQuaC Watcher as a long-running background service using systemd.

Important

  • If you do not use SMB mounts ("mounts": [] in config/app.json), you can run it as a user service.
  • If you use SMB mounts, MCQuaC Watcher needs to call mount.cifs and therefore must run as root as a system service.

In the examples below, replace /path/to/mcquac-watcher with the actual project directory and adapt the Python/venv path as needed.

Option A: User service (no SMB mounts)

Use this if mounts in config/app.json is empty and you do not mount any network shares from the watcher itself.

  1. Create a user unit:

    mkdir -p ~/.config/systemd/user
    nano ~/.config/systemd/user/mcquac-watcher.service
  2. Paste the following unit file:

    [Unit]
    Description=MCQuaC Watcher (user service)
    After=network-online.target docker.service
    Wants=network-online.target
    
    [Service]
    Type=simple
    WorkingDirectory=/path/to/mcquac-watcher
    ExecStart=/path/to/mcquac-watcher/.venv/bin/python -u main.py
    Restart=always
    RestartSec=10
    
    StandardOutput=journal
    StandardError=journal
    
    [Install]
    WantedBy=default.target
  3. Reload and start:

    systemctl --user daemon-reload
    systemctl --user start mcquac-watcher.service
    systemctl --user enable mcquac-watcher.service
  4. View logs:

    journalctl --user-unit mcquac-watcher.service -f

On WSL you must have systemd enabled and the distro restarted so that systemctl --user works.


Option B: System/root service (with SMB mounts)

Use this if you configured any mounts in config/app.json. Mounting CIFS shares requires root privileges.

  1. Create a system unit as root:

    sudo nano /etc/systemd/system/mcquac-watcher.service
  2. Paste the following unit file:

    [Unit]
    Description=MCQuaC Watcher (root + SMB mounts)
    After=network-online.target docker.service
    Wants=network-online.target
    
    [Service]
    Type=simple
    WorkingDirectory=/path/to/mcquac-watcher
    ExecStart=/path/to/mcquac-watcher/.venv/bin/python -u main.py
    Restart=always
    RestartSec=10
    
    # Example if you want to force a specific Nextflow binary:
    # Environment="NEXTFLOW_BIN=/path/to/mcquac-watcher/nextflow"
    
    [Install]
    WantedBy=multi-user.target
  3. Reload and start:

    sudo systemctl daemon-reload
    sudo systemctl start mcquac-watcher.service
    sudo systemctl enable mcquac-watcher.service
  4. View logs:

    sudo journalctl -u mcquac-watcher.service -f

Troubleshooting

  • Docker daemon not reachable — ensure the service is running and your user is in the docker group. Log out/in or run newgrp docker.

  • nextflow not found — set nextflow_bin in config/app.json or export $NEXTFLOW_BIN.

  • main.nf missing — verify mcquac_path and that the McQuaC repo/branch is cloned correctly.

  • SMB mount errors — run with sudo, install cifs-utils, check network/port 445; optionally set continue_on_mount_error: true.

  • No watchers active — ensure io_pairs in config/app.json is a non-empty list.

  • No output delivered / network share flaky — delivery is retried automatically. If you reboot, make sure you do not wipe tmp/ on startup (comment out nuke_tmp() in main.py).

  • Stuck temp state / cleanup — clear the working area:

    python3 -m src.clear
    # or run the helper directly if you prefer:
    # python3 src/clear.py

Project layout

project/
├── main.py
├── src/
│   ├── load_config.py      # read & validate config/app.json (incl. mounts, nextflow_bin)
│   ├── search.py           # watcher thread (stable candidates via repeated scans)
│   ├── size.py             # file listing & size helper with glob + ignore patterns
│   ├── copier.py           # copy candidates → tmp/<hash>; generate mcquac.json/info.json/.ready
│   ├── job_creater.py      # apply %%%INPUT%%% / %%%OUTPUT%%% into mcquac.json from template
│   ├── mcquac_runner.py    # consume .ready, run Nextflow, deliver results, write .delivered
│   ├── mounter.py          # optional SMB mounting from config
│   └── clear.py            # `nuke_tmp()` to clean ./tmp
├── config/
│   ├── app.json            # main configuration
│   ├── mcquac.json         # default McQuaC template (placeholders supported)
│   ├── mcquac_*.json       # optional additional templates (selectable via mcquac_template)
│   ├── fasta/              # *.fasta (top level)
│   └── spike/              # *.csv   (top level)
├── tmp/                    # working directory (hash folders)
└── setup.sh                # optional bootstrap script

License

--------------------------------------------------------------------------------
                                   autoQuaC
--------------------------------------------------------------------------------
Copyright 2021, Ruhr University Bochum, Medizinisches Proteom-Center

This software is released under a three-clause BSD license:

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice, this
  list of conditions and the following disclaimer in the documentation and/or
  other materials provided with the distribution.

* Neither the name of any author or any participating institution may be used to
  endorse or promote products derived from this software without specific prior
  written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Acknowledgments

  • McQuaC (mpc-bioinformatics) for the underlying QC pipeline.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors