Skip to content

docxology/MetaInformAnt

Repository files navigation

🐜 METAINFORMANT

Comprehensive bioinformatics toolkit for multi-omic analysis

Python 3.11+ License: Apache 2.0 Code style: black Modules Files


Overview

METAINFORMANT provides production-ready bioinformatics analysis across genomics, transcriptomics, proteomics, epigenomics, and systems biology. Built with Python 3.11+ and uv for fast dependency management.

📊 At a Glance

Metric Value
Modules 25 specialized analysis modules
Python Files 343 implementation files
Plot Types 57+ visualization methods
Documentation 130+ README files

🔬 Core Capabilities

Domain Features
DNA Sequences, alignment, phylogenetics, population genetics, variant analysis
RNA Amalgkit integration, SRA downloads, Kallisto quantification, industrial-scale pipelines
GWAS Association testing, fine-mapping, eQTL analysis, colocalization, visualization
Multi-omics Cross-omic integration, joint PCA, correlation analysis
ML Classification, regression, feature selection, LLM integration
Visualization Manhattan plots, heatmaps, networks, animations, publication-ready output

System Architecture

flowchart TB
    subgraph coreInfra[coreInfrastructure]
        CORE[coreUtilities]
    end

    subgraph molecular[molecularAnalysis]
        DNA[dnaAnalysis]
        RNA[rnaAnalysis]
        PROT[proteinAnalysis]
        EPI[epigenomeAnalysis]
    end

    subgraph statsML[statisticalAndML]
        GWAS[gwasAnalysis]
        MATH[mathAnalysis]
        ML[machineLearning]
        INFO[informationAnalysis]
    end

    subgraph systems[systemsBiology]
        NET[networksAnalysis]
        MULTI[multiomicsAnalysis]
        SC[singlecellAnalysis]
        SIM[simulationAnalysis]
    end

    subgraph annotation[annotationAndMetadata]
        ONT[ontologyAnalysis]
        PHEN[phenotypeAnalysis]
        ECO[ecologyAnalysis]
        LE[lifeEventsAnalysis]
    end

    subgraph utilities[utilities]
        QUAL[qualityControl]
        VIZ[visualization]
    end

    CORE --> DNA
    CORE --> RNA
    CORE --> PROT
    CORE --> EPI
    CORE --> GWAS
    CORE --> MATH
    CORE --> ML
    CORE --> INFO
    CORE --> NET
    CORE --> MULTI
    CORE --> SC
    CORE --> SIM
    CORE --> ONT
    CORE --> PHEN
    CORE --> ECO
    CORE --> LE
    CORE --> QUAL
    CORE --> VIZ
Loading

Data Flow and Integration Architecture

graph TD
    ArawBiologicalData[Raw Biological Data] --> BdataIngestion[Data Ingestion]
    B --> C{Data Type}

    C -->|DNA| D[dnaModule]
    C -->|RNA| E[rnaModule]
    C -->|Protein| F[proteinModule]
    C -->|Epigenome| G[epigenomeModule]
    C -->|Phenotype| H[phenotypeModule]
    C -->|Environmental| I[ecologyModule]

    D --> JqualityControl[Quality Control]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J

    J --> KcoreProcessing[Core Processing]
    K --> L{Analysis Type}

    L -->|Statistical| M[gwasModule]
    L -->|ML| N[mlModule]
    L -->|Information| O[informationModule]
    L -->|Networks| P[networksModule]
    L -->|Systems| Q[multiomicsModule]
    L -->|Singlecell| R[singlecellModule]
    L -->|Simulation| S[simulationModule]

    M --> TresultsIntegration[Results Integration]
    N --> T
    O --> T
    P --> T
    Q --> T
    R --> T
    S --> T

    T --> U[Visualization]
    U --> VpublicationFigures[Publication Figures]
    V --> WscientificInsights[Scientific Insights]

    subgraph "Primary Data Types"
        X[Genomic] -.-> D
        Y[Transcriptomic] -.-> E
        Z[Proteomic] -.-> F
        AA[Epigenetic] -.-> G
    end

    subgraph "Analysis Workflows"
        BBpopulationGenetics[Population Genetics] -.-> M
        CCfeatureSelection[Feature Selection] -.-> N
        DDmutualInformation[Mutual Information] -.-> O
        EEcommunityDetection[Community Detection] -.-> P
        FFjointPca[Joint PCA] -.-> Q
        GGtrajectoryAnalysis[Trajectory Analysis] -.-> R
    end

    subgraph "Output Formats"
        HHmanhattanPlots[Manhattan Plots] -.-> V
        II[Heatmaps] -.-> V
        JJnetworkGraphs[Network Graphs] -.-> V
        KK[Animations] -.-> V
    end
Loading

Multi-Omic Integration Pipeline

graph TD
    Amulti-omicDatasets[Multi-Omic Datasets] --> BsampleAlignment[Sample Alignment]
    B --> CbatchEffectCorrection[Batch Effect Correction]

    C --> D{Integration Strategy}
    D -->|Early| EconcatenatedMatrix[Concatenated Matrix]
    D -->|Late| FseparateModels[Separate Models]
    D -->|Intermediate| G[Meta-Analysis]

    E --> HjointDimensionalityReduction[Joint Dimensionality Reduction]
    F --> IindividualAnalysis[Individual Analysis]
    G --> JresultIntegration[Result Integration]

    H --> KunifiedClustering[Unified Clustering]
    I --> LindividualClustering[Individual Clustering]
    J --> MconsensusClustering[Consensus Clustering]

    K --> NfunctionalEnrichment[Functional Enrichment]
    L --> N
    M --> N

    N --> OpathwayAnalysis[Pathway Analysis]
    O --> PnetworkConstruction[Network Construction]

    P --> QbiologicalInterpretation[Biological Interpretation]
    Q --> RsystemsBiologyInsights[Systems Biology Insights]

    subgraph "Omic Layers"
        S[Genomics] -.-> A
        T[Transcriptomics] -.-> A
        U[Proteomics] -.-> A
        V[Metabolomics] -.-> A
        W[Epigenomics] -.-> A
    end

    subgraph "Integration Methods"
        X[MOFA] -.-> H
        YjointPca[Joint PCA] -.-> H
        ZsimilarityNetworks[Similarity Networks] -.-> H
    end

    subgraph "Biological Outputs"
        AAgeneModules[Gene Modules] -.-> Q
        BBregulatoryNetworks[Regulatory Networks] -.-> Q
        CCdiseasePathways[Disease Pathways] -.-> Q
        DD[Biomarkers] -.-> Q
    end
Loading

Quality Assurance Framework

graph TD
    AdataProcessingPipeline[Data Processing Pipeline] --> BinputValidation[Input Validation]
    B --> CtypeChecking[Type Checking]
    C --> DschemaValidation[Schema Validation]

    D --> EprocessingLogic[Processing Logic]
    E --> FerrorHandling[Error Handling]
    F --> GrecoveryMechanisms[Recovery Mechanisms]

    G --> HoutputValidation[Output Validation]
    H --> IresultVerification[Result Verification]
    I --> JqualityMetrics[Quality Metrics]

    J --> K{Acceptable Quality?}
    K -->|Yes| LpipelineSuccess[Pipeline Success]
    K -->|No| MqualityIssues[Quality Issues]

    M --> NdiagnosticAnalysis[Diagnostic Analysis]
    N --> OerrorClassification[Error Classification]

    O --> P{Recoverable?}
    P -->|Yes| QdataCorrection[Data Correction]
    P -->|No| RpipelineFailure[Pipeline Failure]

    Q --> E
    L --> SvalidatedResults[Validated Results]
    R --> TerrorReporting[Error Reporting]

    subgraph "Validation Layers"
        UdataIntegrity[Data Integrity] -.-> B
        VbusinessLogic[Business Logic] -.-> E
        WstatisticalValidity[Statistical Validity] -.-> H
    end

    subgraph "Quality Controls"
        XunitTests[Unit Tests] -.-> F
        YintegrationTests[Integration Tests] -.-> I
        ZperformanceBenchmarks[Performance Benchmarks] -.-> J
    end

    subgraph "Error Types"
        AAdataErrors[Data Errors] -.-> O
        BBlogicErrors[Logic Errors] -.-> O
        CCsystemErrors[System Errors] -.-> O
        DDexternalErrors[External Errors] -.-> O
    end
Loading

Key Features

  • Multi-Omic Analysis: DNA, RNA, protein, and epigenome data integration
  • Statistical & ML Methods: GWAS, population genetics, machine learning pipelines
  • Single-Cell Genomics: Complete scRNA-seq analysis workflows
  • Network Analysis: Biological networks, pathways, community detection algorithms
  • Visualization Suite: 14 specialized plotting modules with 57+ plot types and publication-quality output
  • Modular Architecture: Individual modules or complete end-to-end workflows
  • Comprehensive Documentation: 70+ README files with technical specifications
  • Implementation Testing: Real methods in tests, no mocks or stubs
  • Quality Assurance: Rigorous validation and error handling throughout
  • Performance Optimization: Efficient algorithms for large-scale biological data

Quick Start

Prerequisites

  • Python 3.11+
  • uv - Fast Python package manager (REQUIRED)
    • Install: curl -LsSf https://astral.sh/uv/install.sh | sh
    • Verify: uv --version

Installation

METAINFORMANT uses uv for all package management. Never use pip directly.

# Clone repository
git clone https://github.com/docxology/metainformant.git
cd metainformant

# Automated setup with uv (recommended - handles FAT filesystems automatically)
bash scripts/package/setup.sh

# Or manual setup with uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv if needed
uv venv
source .venv/bin/activate  # or /tmp/metainformant_venv/bin/activate on FAT filesystems
uv pip install -e .

Package Management: All Python dependencies are managed via uv:

  • Create venv: uv venv
  • Install packages: uv pip install -e .
  • Run commands: uv run pytest, uv run metainformant --help
  • Sync dependencies: uv sync --extra dev --extra scientific
  • Add dependencies: uv add <package>
  • Remove dependencies: uv remove <package>

Note: Setup scripts automatically detect FAT filesystems (exFAT, FAT32) and configure UV cache and virtual environment locations accordingly. See UV Setup Guide for details.

Quick Example

from metainformant.dna import sequences, composition
from metainformant.visualization import lineplot

# Load DNA sequences
seqs = sequences.read_fasta("data/sequences.fasta")

# Analyze GC content
gc_values = [sequences.gc_content(seq) for seq in seqs.values()]

# Visualize
ax = lineplot(None, gc_values)
ax.set_ylabel("GC Content")
ax.set_title("GC Content Across Sequences")
ax.figure.savefig("output/gc_content.png", dpi=300)

Complete Workflow Demonstration

# Run workflow demo
python3 scripts/core/run_demo.py

# Demonstrates:
# - Configuration management and I/O operations
# - DNA sequence analysis and visualization
# - Quality control and metrics calculation
# - Real data processing with informative output names
# - Complete output organization in output/demo/ directory

See scripts/core/run_demo.py for the workflow demonstration. Outputs are saved to output/demo/ directory with:

  • Workflow configuration files
  • Processed biological data (FASTA sequences, analysis results)
  • Publication-quality visualizations with informative naming
  • Summary reports and metadata

Module Status Overview

Production-Ready Modules

Category Module Status Key Features
Core core/ Complete I/O, config, logging, parallel, cache, validation, workflow orchestration
DNA dna/ Complete Sequences, alignment, phylogeny, population genetics, variant analysis
RNA rna/ Complete AMALGKIT integration, workflow orchestration, expression quantification
Protein protein/ Complete Sequences, structures, AlphaFold, UniProt, functional analysis
GWAS gwas/ Complete Association testing, QC, population structure, visualization
Math math/ Complete Population genetics, coalescent, selection, epidemiology
Visualization visualization/ Complete 57+ plot types, animations, publication-quality output
Ontology ontology/ Complete GO analysis, semantic similarity, functional annotation
Quality quality/ Complete FASTQ analysis, validation, contamination detection

🟡 Functional Modules (Partial Implementation)

Category Module Status Key Features Coverage
ML ml/ 🟡 Partial Classification, regression, feature selection 75%
Networks networks/ 🟡 Partial Graph algorithms, community detection 78%
Multi-Omics multiomics/ 🟡 Partial Integration, joint PCA, correlation 72%
Single-Cell singlecell/ 🟡 Partial Preprocessing, clustering, DE analysis 74%
Epigenome epigenome/ 🟡 Partial Methylation, ChIP-seq, ATAC-seq 76%
Phenotype phenotype/ 🟡 Partial AntWiki integration, trait analysis 79%
Ecology ecology/ 🟡 Partial Community diversity, environmental 77%
Life Events life_events/ 🟡 Partial Event sequences, embeddings 73%
Simulation simulation/ 🟡 Partial Sequence simulation, ecosystems 71%
Information information/ 🟡 Partial Entropy, mutual information 80%

Module Overview

Complete Module Reference

All modules live in src/metainformant/ with documentation in each module's README.md.

Module Files Description Key Components Docs
Core Infrastructure
core/ 26 Shared utilities, I/O, logging, config, parallel processing, caching io.py, config.py, parallel.py README
Molecular Analysis
dna/ 27 DNA sequences, alignment, phylogenetics, population genetics, variants sequences.py, alignment/, population/ README
rna/ 29 RNA-seq workflows, amalgkit integration, expression quantification amalgkit/, engine/, workflow.py README
protein/ 17 Protein sequences, structure analysis, AlphaFold, UniProt integration sequences.py, structure/ README
epigenome/ 8 Methylation analysis, ChIP-seq, ATAC-seq, chromatin accessibility methylation.py, chipseq.py README
Statistical & ML
gwas/ 39 GWAS, fine-mapping, eQTL analysis, colocalization, visualization finemapping/, visualization/, analysis/ README
math/ 20 Population genetics theory, coalescent, selection, epidemiology popgen/, coalescent/ README
ml/ 12 Machine learning pipelines, classification, regression, features classification.py, features/, llm/ README
information/ 14 Information theory, Shannon entropy, mutual information, semantic similarity entropy.py, integration/ README
Systems Biology
networks/ 9 Biological networks, graph algorithms, community detection, pathways analysis/, interaction/ README
multiomics/ 6 Multi-omic integration, joint PCA, cross-omic correlation analysis/, methods/ README
singlecell/ 9 scRNA-seq preprocessing, clustering, differential expression preprocessing.py, clustering.py README
simulation/ 7 Synthetic data, agent-based models, sequence simulation, ecosystems sequences.py, ecosystems.py README
Annotation & Metadata
ontology/ 7 Gene Ontology, functional annotation, semantic similarity go.py, query.py README
phenotype/ 15 Phenotypic data curation, AntWiki integration, trait analysis life_course.py, antwiki/ README
ecology/ 7 Community diversity, environmental correlations, species matrices community.py, environmental.py README
life_events/ 9 Life course analysis, event sequences, temporal embeddings models/, workflow/ README
Utilities
quality/ 4 FASTQ quality assessment, validation, contamination detection fastq.py, metrics.py README
visualization/ 22 57+ plot types, heatmaps, networks, animations, publication-ready heatmap.py, networks.py, animate.py README
Specialized Domains
longread/ 19 Long-read sequencing (PacBio, ONT), assembly, error correction assembly/, quality/ README
metagenomics/ 11 Metagenomic analysis, taxonomic profiling, functional annotation taxonomy/, functional/ README
pharmacogenomics/ 12 Drug-gene interactions, pharmacokinetics, variant interpretation interactions/ README
spatial/ 11 Spatial transcriptomics, tissue mapping, spatial statistics analysis/ README
structural_variants/ 9 SV detection, CNV analysis, breakpoint resolution detection/ README
menu/ 4 Interactive CLI menu system, workflow navigation ui/ README

Total: 25 modules, 343 Python files

Documentation

Quick Links

Module Documentation

Each module has documentation in src/metainformant/<module>/README.md and docs/<module>/.

Scripts & Workflows

The scripts/ directory contains production-ready workflow orchestrators:

  • Package Management: Setup, testing, quality control
  • RNA-seq: Multi-species workflows, amalgkit integration
  • GWAS: Genome-scale association studies
  • Module Orchestrators: ✅ Complete workflow scripts for all domains (core, DNA, RNA, protein, networks, multiomics, single-cell, quality, simulation, visualization, epigenome, ecology, ontology, phenotype, ML, math, gwas, information, life_events)

See scripts/README.md for documentation.

CLI Interface

All modules are accessible via the unified CLI:

# Setup and environment
uv run metainformant setup --with-amalgkit

# Domain workflows
uv run metainformant dna fetch --assembly GCF_000001405.40
uv run metainformant dna align --input data/sequences.fasta --output output/dna/alignment
uv run metainformant dna variants --input data/variants.vcf --format vcf --output output/dna/variants
uv run metainformant rna run --work-dir output/rna --threads 8 --species Apis_mellifera
uv run metainformant rna run-config --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml
uv run metainformant protein taxon-ids --file data/taxon_ids.txt
uv run metainformant protein rmsd-ca --pdb-a data/structure1.pdb --pdb-b data/structure2.pdb
uv run metainformant gwas run --config config/gwas/gwas_template.yaml

# Epigenome and annotation
uv run metainformant epigenome run --methylation data/methylation.tsv --output output/epigenome
uv run metainformant ontology run --go data/go.obo --output output/ontology
uv run metainformant phenotype run --input data/traits.csv --output output/phenotype
uv run metainformant ecology run --input data/species.csv --output output/ecology --diversity

# Analysis and modeling
uv run metainformant math popgen --input data/sequences.fasta --output output/math/popgen
uv run metainformant math coalescent --n-samples 10 --output output/math/coalescent
uv run metainformant information entropy --input data/seqs.fasta --output output/information
uv run metainformant simulation run --model sequences --output output/simulation

# Systems biology
uv run metainformant networks run --input data/interactions.tsv --output output/networks
uv run metainformant multiomics run --genomics data/genomics.tsv --output output/multiomics
uv run metainformant singlecell run --input data/counts.h5ad --output output/singlecell --qc
uv run metainformant quality run --fastq data/reads.fq --output output/quality --analyze-fastq
uv run metainformant ml run --features data/features.csv --output output/ml --classify
uv run metainformant visualization run --input data/matrix.csv --plot-type heatmap --output output/visualization
uv run metainformant life-events embed --input data/events.json --output output/life_events/embeddings

# See all available commands
uv run metainformant --help

See docs/cli.md for CLI documentation.

Usage Examples

DNA Analysis

from metainformant.dna import alignment, population

# Pairwise alignment
align_result = alignment.global_align("ACGTACGT", "ACGTAGGT")
print(f"Score: {align_result.score}")

# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
diversity = population.nucleotide_diversity(sequences)
print(f"π = {diversity:.4f}")

RNA-seq Workflow

from metainformant.rna import AmalgkitWorkflowConfig, plan_workflow, execute_workflow, check_cli_available

# Check if amalgkit is available
available, help_text = check_cli_available()
if not available:
    print(f"Amalgkit not available: {help_text}")

# Configure workflow
config = AmalgkitWorkflowConfig(
    work_dir="output/amalgkit/work",
    threads=8,
    species_list=["Apis_mellifera"]
)

# Plan workflow steps
steps = plan_workflow(config)
print(f"Planned {len(steps)} workflow steps")

# Execute workflow
results = execute_workflow(config)
for step, result in results.items():
    print(f"{step}: exit code {result.returncode}")
# End-to-end workflow for a single species (recommended)
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

# Check status
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml --status

# Alternative: Bash-based orchestrator
bash scripts/rna/amalgkit/run_amalgkit.sh --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

GWAS Analysis

from metainformant.gwas import run_gwas, manhattan_plot, load_gwas_config

# Load configuration and run workflow
config = load_gwas_config("config/gwas/gwas_template.yaml")
results = run_gwas(
    vcf_path="data/variants/cohort.vcf.gz",
    phenotype_path="data/phenotypes/traits.tsv",
    config={"association": {"model": "linear"}},
    output_dir="output/gwas"
)

# Visualize results
manhattan_plot(results["association_results"], output_path="output/gwas/manhattan.png")

Visualization

from metainformant.visualization import heatmap, animate_time_series

# Heatmap
heatmap(correlation_matrix, cmap="viridis", annot=True)

# Animation
fig, anim = animate_time_series(time_series_data)
anim.save("output/animation.gif")

Network Analysis

from metainformant.networks import create_network, detect_communities, centrality_measures

# Create network from interactions
network = create_network(edges, directed=False)

# Detect communities
communities = detect_communities(network)

# Calculate centrality
centrality = centrality_measures(network)

Multi-Omics Integration

from metainformant.multiomics import integrate_omics_data, joint_pca

# Integrate multiple omics datasets
multiomics = integrate_omics_data(
    genomics=genomics_data,
    transcriptomics=rna_data,
    proteomics=protein_data
)

# Joint dimensionality reduction
pca_result = joint_pca(multiomics)

Information Theory

from metainformant.information import shannon_entropy, mutual_information, information_content

# Calculate Shannon entropy
probs = [0.5, 0.3, 0.2]
entropy = shannon_entropy(probs)

# Mutual information between sequences
mi = mutual_information(sequence_x, sequence_y)

# Information content for hierarchical terms
ic = information_content(term_frequencies, "GO:0008150")

Life Events Analysis

from metainformant.life_events import EventSequence, Event, analyze_life_course
from datetime import datetime

# Create event sequences
events = [
    Event("degree", datetime(2010, 6, 1), "education"),
    Event("job_change", datetime(2015, 3, 1), "occupation"),
]
sequence = EventSequence(person_id="person_001", events=events)

# Analyze life course
results = analyze_life_course([sequence], outcomes=None)

Protein Analysis

from metainformant.protein import sequences, alignment, structure

# Read protein sequences
proteins = sequences.read_fasta("data/proteins.fasta")

# Pairwise alignment
align_result = alignment.global_align(proteins["seq1"], proteins["seq2"])

# Structure analysis
structure_data = structure.load_pdb("data/structure.pdb")
contacts = structure.analyze_contacts(structure_data)

Epigenome Analysis

from metainformant.epigenome import methylation, chipseq

# Methylation analysis
meth_data = methylation.load_bedgraph("data/methylation.bedgraph")
regions = methylation.find_dmr(meth_data, threshold=0.3)

# ChIP-seq peak calling
peaks = chipseq.call_peaks("data/chipseq.bam", "data/control.bam")

Ontology Analysis

from metainformant.ontology import go, query

# Load Gene Ontology
go_graph = go.load_obo("data/go.obo")

# Query ontology
terms = query.get_ancestors(go_graph, "GO:0008150")
similarity = query.semantic_similarity(go_graph, "GO:0008150", "GO:0008151")

Phenotype Analysis

from metainformant.phenotype import life_course, antwiki

# Life course analysis
traits = life_course.load_traits("data/traits.csv")
curated = life_course.curate_traits(traits)

# AntWiki integration
species_data = antwiki.fetch_species("Pogonomyrmex_barbatus")

Ecology Analysis

from metainformant.ecology import community, environmental

# Community analysis
species_matrix = community.load_matrix("data/species.csv")
diversity = community.calculate_diversity(species_matrix)

# Environmental data
env_data = environmental.load_data("data/environment.csv")
correlations = environmental.analyze_correlations(species_matrix, env_data)

Mathematical Biology

from metainformant.math import popgen, coalescent

# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
fst = popgen.fst(sequences, populations=[0, 0, 1])

# Coalescent simulation
tree = coalescent.simulate_coalescent(n_samples=10, Ne=1000)

Single-Cell Analysis

from metainformant.singlecell import preprocessing, clustering

# Load single-cell data
adata = preprocessing.load_h5ad("data/counts.h5ad")

# Preprocessing
adata = preprocessing.filter_cells(adata, min_genes=200)
adata = preprocessing.normalize(adata)

# Clustering
clusters = clustering.leiden(adata, resolution=0.5)

Quality Control

from metainformant.quality import fastq, metrics

# FASTQ quality assessment
qc_report = fastq.assess_quality("data/reads.fastq")
print(f"Mean quality: {qc_report['mean_quality']}")

# General metrics
quality_score = metrics.calculate_quality(data_matrix)

Machine Learning

from metainformant.ml import classification, features

# Feature extraction
features = features.extract_features(data, method="pca", n_components=50)

# Classification
model = classification.train_classifier(
    X_train, y_train, method="random_forest"
)
predictions = model.predict(X_test)

Simulation

from metainformant.simulation import sequences, ecosystems

# Sequence simulation
sim_seqs = sequences.simulate_sequences(
    n_sequences=100, length=1000, mutation_rate=0.01
)

# Ecosystem simulation
ecosystem = ecosystems.simulate_community(
    n_species=50, interactions="random"
)

Core Utilities

from metainformant.core import io, paths, logging

# I/O operations
data = io.load_json("config/example.yaml")
io.dump_json(results, "output/results.json")

# Path handling
resolved = paths.expand_and_resolve("~/data/input.txt")
is_safe = paths.is_within(resolved, base_path="/safe/directory")

# Logging
logger = logging.get_logger(__name__)
logger.info("Processing data")

Development

Running Tests

# All tests
bash scripts/package/test.sh

# Fast tests only
bash scripts/package/test.sh --mode fast

# Specific module
pytest tests/dna/ -v

Code Quality

# Check code quality
bash scripts/package/uv_quality.sh

# Run linting
ruff check src/

# Type checking
mypy src/metainformant

Project Structure

MetaInformAnt/
├── src/metainformant/       # Main package
│   ├── core/               # Core utilities
│   ├── dna/                # DNA analysis
│   ├── rna/                # RNA analysis
│   ├── protein/            # Protein analysis
│   ├── gwas/               # GWAS analysis
│   └── ...                 # Additional modules
├── scripts/                # Workflow scripts
│   ├── package/            # Package management
│   ├── rna/                # RNA workflows
│   ├── gwas/               # GWAS workflows
│   └── ...                 # Module scripts
├── docs/                   # Documentation
├── tests/                  # Test suite
├── config/                 # Configuration files
├── output/                 # Analysis outputs
└── data/                   # Input data

AI-Assisted Development

This project was developed with AI assistance (grok-code-fast-1 via Cursor) to enhance:

  • Code generation and algorithm implementation
  • Comprehensive documentation
  • Test case generation
  • Architecture design

All AI-generated content undergoes human review. See AGENTS.md for details.

Known Limitations

Module Completeness

Some modules have partial implementations or optional dependencies:

  • Machine Learning: Framework exists; some methods may need completion (see ML Documentation)
  • Multi-omics: Integration methods implemented; additional dependencies may be required
  • Single-cell: Requires scipy, scanpy, anndata (see Single-Cell Documentation)
  • Network Analysis: Algorithms implemented; regulatory network features may need enhancement

GWAS Module

  • Variant Download: Database download (dbSNP, 1000 Genomes) is a placeholder; use SRA-based workflow or provide VCF files
  • Functional Annotation: Requires external tools (ANNOVAR, VEP, SnpEff) for variant annotation
  • Mixed Models: Relatedness adjustment implemented; MLM methods may require GCTA/EMMAX integration

Test Coverage

Some modules have lower test success rates due to optional dependencies:

  • Single-cell: Requires scientific dependencies (scanpy, anndata)
  • Multi-omics: Framework exists, tests may skip without dependencies
  • Network Analysis: Tests pass; features may need additional setup

See Testing Guide for detailed testing documentation and coverage information.

Best Practices

File Naming

  • ✅ Use informative names: sample_pca_biplot_colored_by_treatment.png
  • ❌ Avoid generic names: plot1.png, output.png

Output Organization

  • All outputs in output/ directory
  • Configuration saved with results
  • Visualizations in subdirectories with metadata

No Mocking Policy

  • All tests use implementations
  • No fake/mocked/stubbed methods
  • Real API calls or graceful skips
  • Ensures actual functionality

Requirements

  • Python 3.11+
  • Optional: SRA Toolkit, kallisto (for RNA workflows)
  • Optional: samtools, bcftools, bwa (for GWAS)

Contributing

Contributions are welcome! Please:

  1. Follow the existing code style
  2. Add tests for new features
  3. Update documentation
  4. Use informative commit messages

Recent Improvements

Performance Enhancements

  • Intelligent Caching: Automatic caching for expensive computations (Tajima's constants, entropy calculations)
  • NumPy Vectorization: Optimized mathematical operations for 10-100x performance improvements
  • Progress Tracking: Real-time progress bars for long-running analyses
  • Memory Optimization: Efficient algorithms for large datasets

Enhanced Documentation

  • Comprehensive Tutorials: End-to-end guides for DNA, RNA, GWAS, and information theory workflows
  • Method Comparison Guides: Decision-making guides for choosing analysis algorithms
  • Extended FAQ: Troubleshooting and usage guidance for common scenarios
  • Standardized Docstrings: Consistent formatting with examples and DOI citations

Testing & Reliability

  • Expanded Test Coverage: 37+ new comprehensive tests with real implementations
  • Validation Enhancements: Improved parameter validation and error handling
  • Cross-Platform Compatibility: Python 3.14 support and external drive optimization
  • Integration Testing: Verified cross-module functionality

New Features

  • Enhanced GWAS Visualization: Complete visualization suite for population structure, effects, and comparisons
  • Information Theory Workflows: Batch processing with progress tracking
  • Protein Proteome Analysis: Taxonomy ID processing and proteome utilities
  • Advanced Error Handling: Structured error reporting with actionable guidance

Citation

If you use METAINFORMANT in your research, please cite this repository:

@software{metainformant2025,
  author = {MetaInformAnt Development Team},
  title = {MetaInformAnt: Comprehensive Bioinformatics Toolkit},
  year = {2025},
  url = {https://github.com/docxology/MetaInformAnt},
  version = {0.2.0}
}

License

This project is licensed under the Apache License, Version 2.0 - see LICENSE for details.

Contact

Acknowledgments

  • Developed with AI assistance from Cursor's Code Assistant (grok-code-fast-1)
  • Built on established bioinformatics tools and libraries
  • Community contributions and feedback

Status: Active Development | Version: 0.2.0 | Python: 3.11+ | License: Apache 2.0

About

Integration for Ants and Beyond

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages