Skip to content

tseemann/mlst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI Version License: GPL v2 Don't judge me Bioconda Downloads

mlst

Scan contig files against traditional PubMLST typing schemes

Quick Start

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

% mlst genome.gbk.gz
genome.gbk.gz  sepidermidis  184  arcC(16) aroE(1) gtr(2) mutS(1) pyrR(2) tpiA(1) yqiL(1)

% mlst --full new.fa
FILE    SCHEME       ST  STATUS  SCORE  ALLELES
new.fa  mgenitalium  -   NOVEL   90      MLST_adk(7);MLST_atpA(1);MLST_gmk(1);MLST_gyrB(1);MLST_pgm(3);MLST_ppa(1)

% mlst --label Anthrax GCF_001941925.1_ASM194192v1_genomic.fna.bz2
Anthrax  bcereus  -  glp(24) gmk(1) ilv(~83) pta(1) pur(~71) pyc(37) tpi(41)

% mlst --nopath /opt/data/refseq/S_pyogenes/*.fna
NC_018936.fna  spyogenes  28   gki(4)   gtr(3)   murI(4)   mutS(4)  recP(4)    xpt(2)   yqiL(4)
NC_017596.fna  spyogenes  11   gki(2)   gtr(6)   murI(1)   mutS(2)  recP(2)    xpt(2)   yqiL(2)
NC_008022.fna  spyogenes  55   gki(11)  gtr(9)   murI(1)   mutS(9)  recP(2)    xpt(3)   yqiL(4)
NC_006086.fna  spyogenes  382  gki(5)   gtr(52)  murI(5)   mutS(5)  recP(5)    xpt(4)   yqiL(3)
NC_008024.fna  spyogenes  -    gki(5)   gtr(11)  murI(8)   mutS(5)  recP(15?)  xpt(2)   yqiL(1)
NC_017040.fna  spyogenes  172  gki(56)  gtr(24)  murI(39)  mutS(7)  recP(30)   xpt(2)   yqiL(33)

% mlst --full --fofn files.txt --csv --outfile mlst.csv
# data saved in 'mlst.csv'

Installation

Conda

If you are using Conda

% conda install -c conda-forge -c bioconda  mlst

Source

% cd $HOME
% git clone https://github.com/tseemann/mlst.git
% $HOME/mlst/bin/mlst --help

Usage

Simply just give it a genome file in FASTA/GenBank/EMBL format, optionally compressed with gzip, zip or bzip2.

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

It returns a tab-separated line containing

  • the filename
  • the matching PubMLST scheme name
  • the ST (sequence type)
  • the allele IDs

You can give it multiple files at once, and they can be in FASTA/GenBank/EMBL format, and even compressed with gzip, bzip2 or zip.

% mlst genomes/*
genomes/6008.fna        saureus         239  arcc(2)   aroe(3)   glpf(1)   gmk_(1)   pta_(4)   tpi_(4)   yqil(3)
genomes/strep.fasta.gz  ssuis             1  aroA(1)   cpn60(1)  dpr(1)    gki(1)    mutS(1)   recA(1)   thrA(1)
genomes/NC_002973.gbk   lmonocytogenes    1  abcZ(3)   bglA(1)   cat(1)    dapE(1)   dat(3)    ldh(1)    lhkA(3)
genomes/L550.gbk.bz2    leptospira      152  glmU(26)  pntA(30)  sucA(28)  tpiA(35)  pfkB(39)  mreA(29)  caiB(29)

Without auto-detection

You can force a particular scheme (useful for reporting systems):

% mlst --scheme neisseria NM*
NM003.fa   neisseria  4821  abcZ(222)  adk(3)  aroE(58)  fumC(275)  gdh(30)  pdhC(5)  pgm(255)
NM005.gbk  neisseria  177   abcZ(7)    adk(8)  aroE(10)  fumC(38)   gdh(10)  pdhC(1)  pgm(20)
NM011.fa   neisseria  11    abcZ(2)    adk(3)  aroE(4)   fumC(3)    gdh(8)   pdhC(4)  pgm(6)
NMC.gbk.gz neisseria  8     abcZ(2)    adk(3)  aroE(7)   fumC(2)    gdh(8)   pdhC(5)  pgm(2)

You can make mlst behave like older version before auto-detection existed by providing the --legacy parameter with the --scheme parameter. In that case it will print a fixed tabular output with a heading containing allele names specific to that scheme:

% mlst --legacy --scheme neisseria *.fa
FILE      SCHEME     ST    abcZ  adk  aroE  fumC  gdh  pdhC  pgm
NM003.fa  neisseria  11    2     3    4     3       8     4    6
NM009.fa  neisseria  11149 672   3    4     3       8     4    6
MN043.fa  neisseria  11    2     3    4     3       8     4    6
NM051.fa  neisseria  11    2     3    4     3       8     4    6
NM099.fa  neisseria  1287  2     3    4    17       8     4    6
NM110.fa  neisseria  11    2     3    4     3       8     4    6

Available schemes

To see which MLST schemes are supported:

% mlst --info | csvtk -t pretty

CHEME           LOCII   TYPES   ALLELES   DATE         LOCII_NAMES
--------------   -----   -----   -------   ----------   --------------------------------------------------
mbovis           7       193     154       2025-06-25   adh1 gltX gpsA gyrB pta2 tdk tkt
mhominis_3       11      43      190       2023-11-05   eST uvrA gyrB ftsY tuf gap p120' vaa lmp1 lmp3 p60
mhyopneumoniae   3       255     254       2025-12-14   adk rpoB tpiA
mcanis           7       83      153       2019-10-21   ack cpn60 fdh pta purA sar tuf
mhyorhinis       6       265     148       2025-08-20   dnaA rpoB gyrB gltX adk gmk
mgallisepticum   7       119     249       2025-12-05   atpG dppC DUF3196 lgT mraW plsC ugpA
mflocculare      3       8       22        2018-07-03   adk rpoB tpiA
...

This outpuit is TSV by default but will honour the --csv option. The older --list and --longlist are still available for backward compatibility.

Missing data

mlst does not just look for exact matches to full length alleles. It attempts to tell you as much as possible about what it found using the notation below:

Symbol Meaning Length Identity
n exact intact allele 100% 100%
~n novel full length allele similar to n 100% --minid
n? partial match to known allele --mincov --minid
- allele missing < --mincov < --minid
n,m multiple alleles    

Scoring system

Each MLST prediction gets a score out of 100. The score for a scheme with N alleles is as follows:

Points For Example
+90/N exact allele match 42
+63/N novel allele match (50% of an exact allele) ~42
+18/N partial allele match (20% of an exact alelle) 42?
0 missing allele -
+10 a matching ST type for the allele combination 248

It is possible to filter results using the --minscore option which takes a value between 1 and 100. If you only want to report known ST types, then use --minscore 100. To also include novel combinations of existing alleles with no ST type, use --minscore 90. The default is --minscore 50 which is an ad hoc value I have found allows for genuine partial ST matches but eliminates false positives.

Output formats

There are 3 output formats. I recommened using --full mode. By default they are TSV, but CSV can be enabled with --csv.

Default

This format does not have any column headings.

Column Description Example
1 Filename genome.gbk
2 Scheme mgenitalium
3 Sequence Type 148
4 Alelle 1 adk(7)
5 Allele 2 atpA(1)
6 + Allele 3 ... ...

Full --full (recommended)

This preferred format has 6 columns:

Column Description Example
FILE Input filename genome.gbk
SCHEME Auto-detected scheme mgenitalium
ST Sequence Type assined 148
STATUS Quality of genotype NOVEL (read more)
SCORE Score of genotype 90
ALLELES Indetified alleles adk(7);atpA(1);gmk(1);gyrB(1);pgm(3);ppa(1)

Status

These codes are in development. Some of them are stable, but others are subject to change.

STATUS Meaning Stable?
PERFECT Exact matches to a known ST YES
NOVEL Exact matches, but not ST yet YES
NONE No allele matches whatsoever YES
MIXED Has at least one mixed allele YES
MISSING Has at least one missing allele no
BAD If none of the above & score below 70 no
OK If none of the above no

Legacy --legacy

This format has a variable number of columns per line, depending on how many allees are in the scheme found. This makes it hard to use for mixtures of species, so you should use --full for that mode.

Column Description Example
FILE Input filename genome.gbk
SCHEME From --scheme mgenitalium
ST Sequence Type 148
ALLELE_1 Allele 1 numner 7
ALLELE_2 Allele 2 numbner 1
ALLELE_n Allele number integer

Tweaking the output

The output is TSV (tab-separated values). This makes it easy to parse and manipulate with Unix utilities like cut and sort etc. For example, if you only want the filename and ST you can do the following:

% mlst --scheme abaumanii AB*.fasta | cut -f1,3 > ST.tsv

If you prefer CSV because it loads more smoothly into MS Excel, use the --csv option:

% mlst --csv Peptobismol.fna.gz > mlst.csv

JSON output is available too; it returns an array of dictionaries, one per input file. The id will be the same as filename unless --label is used, but that only works when scanning a single file.

% mlst -q --json out.json test/example.gbk.gz test/novel.fasta.bz2
% cat out.json
[
   {
      "scheme" : "sepidermidis",
      "alleles" : {
         "mutS" : "1",
         "yqiL" : "1",
         "tpiA" : "1",
         "pyrR" : "2",
         "gtr" : "2",
         "aroE" : "1",
         "arcC" : "16"
      },
      "sequence_type" : "184",
      "filename" : "test/example.gbk.gz",
      "id" : "test/example.gbk.gz"
   },
   {
      "sequence_type" : "-",
      "filename" : "test/novel.fasta.bz2",
      "scheme" : "spneumoniae",
      "alleles" : {
         "gki" : "2",
         "aroE" : "7",
         "ddl" : "22",
         "gdh" : "15",
         "xpt" : "1",
         "recP" : "~10",
         "spi" : "6"
      },
      "id" : "test/novel.fasta.bz2"
   }
]

You can also save the "novel" alleles for submission to PubMLST::

% mlst -q --novel nouveau.fa s_myces.fasta

% cat nouveau.fa

>streptomyces.recA-e562a2cd93e701e3b58ba0670bcbba0c s_myces.fasta
GACGTGGCCCTCGGCGTCGGCGGTCTGCCGCGCGGCCGCGTCGTCGAGATCTACGGACCGGAGTCCTCC...

The format of the sequence IDs is scheme.allele-hash filename where hash is the hexadecimal MD5 digest of the allele DNA sequence.

Mapping to genus/species

Included is a file called db/scheme_species_map.tab which has 3 tab-separated columns as follows:

#SCHEME GENUS   SPECIES
abaumannii      Acinetobacter   baumannii
abaumannii_2    Acinetobacter   baumannii
achromobacter   Achromobacter
aeromonas       Aeromonas
afumigatus      Aspergillus     afumigatus
arcobacter      Arcobacter
bburgdorferi    Borrelia        burgdorferi
bhampsonii      Brachyspira     hampsonii
bhenselae       Bartonella      henselae
borrelia        Borrelia
bpilosicoli     Brachyspira     pilosicoli
<snip>

Note that that some schemes are species specific, and others are genus specific, so the SPECIES column is empty. Note that the same species/genus can apply to multiple schemes, see abaumanii above.

Updating the bundled database

The mlst software no longer provides a script to update the database. This is because PubMLST now requires a user account and a private key to access data through the PubMLST API. You can use the mlstdb tool to help you do this.

If you do download a new database, make sure it's in /path/to/mlst/db/pubmlst and run scripts/mlst-make_blast_db before attempting to run mlst.

Adding a scheme

If you want to add a custom private scheme with mlst you can

The directory structure

Each MLST scheme exists in a folder withing the mlst/db/pubmlst folder. The name of the folder is the scheme name, say saureus for Staphylococcus aureus. It contains files like this:

% cd mlst/db/pubmlst/sareus
% ls -1
saureus.txt
arcC.tfa
aroE.tfa
glpF.tfa
gmk.tfa
pta.tfa
tpi.tfa
yqiL.tfa

The folder name (ie. saureus) must be the same name as the scheme file (ie. saureus.txt) or it will not work.

The scheme file

The saureus.txt is a tab-separated file containing one ST definition per row. The header line must be present. Extra columns with names mlst_clade,clonal_complex,species,CC,Lineage are ignored.

% head -n 5 saureus.txt
ST      arcC    aroE    glpF    gmk     pta     tpi     yqiL    clonal_complex
1       1       1       1       1       1       1       1
2       2       2       2       2       2       2       26
3       1       1       1       9       1       1       12
4       10      10      8       6       10      3       2

The allele sequence files

Each of the .tfa files are nucleotide FASTA files with the allele sequences for each locus. There must be a .tfa file for each and every allele locus in the TSV scheme .txt file. Here is what the arcC.tfa file looks like:

% head -n 20 arcC.tfa
>arcC_1
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAGGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTTACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTCAATAACCCAACCAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGACTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG
>arcC_2
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAAGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTAACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTTGATAACCCAACTAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGGCTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG

The FASTA sequence IDs must be named as >allele_number or >allele-number. Ideally the sequences will not contain any ambiguous IUPAC symbols. i.e. just A,T,C,G.

Adding a new scheme

  1. Make a new folder in mlst/db/pubmlst/SCHEME
  2. Put your SCHEME.txt file in there
  3. Put your ALLELE.tfa files in there
  4. Run mlst/scripts/mlst-make_blast_db to update the BLAST indices
  5. Run mlst --info | grep SCHEME to see if it exists
  6. Run mlst --scheme SCHEME file.fasta to see if it works

If it doesn't - go back and check you really did do Step 4 above.

Citations

The mlst software incorporates components of the PubMLST database which must be cited in any publications that use mlst:

"This publication made use of the PubMLST website (https://pubmlst.org/) developed by Keith Jolley Wellcome Open Res. 2018 Sep 24:3:124 and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust".

You should also cite this software (currently unpublished) as:

Feedback

Please submit via the Github Issues page

Licence

GPL v2

Author

About

🆔 Scan contig files against PubMLST typing schemes

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages