Skip to content

Latest commit

 

History

History
203 lines (138 loc) · 4.18 KB

File metadata and controls

203 lines (138 loc) · 4.18 KB

Batch ProtParam

Batch calculation of protein physicochemical properties from FASTA files using Biopython.

This script reproduces many of the metrics provided by the ExPASy ProtParam web tool but allows high-throughput analysis of hundreds to hundreds of thousands of sequences locally without using a web browser.

Inspired by the ExPASy ProtParam tool:
https://web.expasy.org/protparam/

The output is written as CSV files that open directly in Excel, R, or Python.


Features

For each protein sequence the script calculates:

  • Amino acid counts
  • Amino acid percentages
  • Molecular weight
  • Aromaticity
  • Theoretical isoelectric point (pI)
  • Secondary structure fraction
    • helix\
    • turn\
    • sheet
  • GRAVY (hydrophobicity score)
  • Instability index
  • Flexibility statistics
    • mean\
    • minimum\
    • maximum\
    • standard deviation

Additional metadata columns include:

  • source FASTA file
  • warnings (e.g., dropped residues)
  • error handling status

Installation

Requires Python 3.8+

Install dependency:

pip install biopython

Or install from requirements:

pip install -r requirements.txt

Example requirements.txt:

biopython

Example Folder Structure

project_folder/
│
├── batchProtParam.py
├── fastas/
│   ├── proteins1.fasta
│   └── proteins2.fasta

Basic Usage

Run from the project directory:

python batchProtParam.py --in_dir ./fastas --out_dir ./results

This produces:

results/
    proteins1.protparam.csv
    proteins2.protparam.csv

Output Modes

One CSV per FASTA (default)

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode per_fasta

One combined CSV for all FASTAs

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode all_fastas

Optional custom filename:

python batchProtParam.py \
  --in_dir ./fastas \
  --out_dir ./results \
  --output_mode all_fastas \
  --all_fastas_name my_results.csv

Handling Ambiguous Amino Acids

Sequences sometimes contain non-standard residues.

Residue Meaning


X unknown B D or N Z E or Q U selenocysteine O pyrrolysine

You can control how these are handled.

Default (recommended)

--ambiguous drop

Removes non-standard residues before calculation.

Strict mode

--ambiguous fail

Skips sequences containing ambiguous residues.

Advanced mode

--ambiguous keep

Keeps residues unchanged (may cause calculation errors).


Example Output Columns

seq_id
length_aa
count_A
pct_A
molecular_weight
aromaticity
theoretical_pi
ss_helix
ss_turn
ss_sheet
gravy
instability_index
flex_mean
flex_min
flex_max
flex_stdev
source_fasta
warnings
status
error_type

Why Use This Script?

The official ProtParam web server is useful for analyzing individual proteins but becomes impractical for large datasets.

This script enables:

  • High-throughput proteome analysis
  • Automated pipelines
  • Reproducible workflows
  • Integration with Python, R, or spreadsheet analysis

Citation

This tool relies on Biopython:

Cock et al. (2009).
Biopython: freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics.


License

MIT License