mlst

Scan contig files against traditional PubMLST typing schemes

Quick Start

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

% mlst genome.gbk.gz
genome.gbk.gz  sepidermidis  184  arcC(16) aroE(1) gtr(2) mutS(1) pyrR(2) tpiA(1) yqiL(1)

% mlst --full new.fa
FILE    SCHEME       ST  STATUS  SCORE  ALLELES
new.fa  mgenitalium  -   NOVEL   90      MLST_adk(7);MLST_atpA(1);MLST_gmk(1);MLST_gyrB(1);MLST_pgm(3);MLST_ppa(1)

% mlst --label Anthrax GCF_001941925.1_ASM194192v1_genomic.fna.bz2
Anthrax  bcereus  -  glp(24) gmk(1) ilv(~83) pta(1) pur(~71) pyc(37) tpi(41)

% mlst --nopath /opt/data/refseq/S_pyogenes/*.fna
NC_018936.fna  spyogenes  28   gki(4)   gtr(3)   murI(4)   mutS(4)  recP(4)    xpt(2)   yqiL(4)
NC_017596.fna  spyogenes  11   gki(2)   gtr(6)   murI(1)   mutS(2)  recP(2)    xpt(2)   yqiL(2)
NC_008022.fna  spyogenes  55   gki(11)  gtr(9)   murI(1)   mutS(9)  recP(2)    xpt(3)   yqiL(4)
NC_006086.fna  spyogenes  382  gki(5)   gtr(52)  murI(5)   mutS(5)  recP(5)    xpt(4)   yqiL(3)
NC_008024.fna  spyogenes  -    gki(5)   gtr(11)  murI(8)   mutS(5)  recP(15?)  xpt(2)   yqiL(1)
NC_017040.fna  spyogenes  172  gki(56)  gtr(24)  murI(39)  mutS(7)  recP(30)   xpt(2)   yqiL(33)

% mlst --full --fofn files.txt --csv --outfile mlst.csv
# data saved in 'mlst.csv'

Installation

Conda

If you are using Conda

% conda install -c conda-forge -c bioconda  mlst

Source

% cd $HOME
% git clone https://github.com/tseemann/mlst.git
% $HOME/mlst/bin/mlst --help

Usage

Simply just give it a genome file in FASTA/GenBank/EMBL format, optionally compressed with gzip, zip or bzip2.

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

It returns a tab-separated line containing

the filename
the matching PubMLST scheme name
the ST (sequence type)
the allele IDs

You can give it multiple files at once, and they can be in FASTA/GenBank/EMBL format, and even compressed with gzip, bzip2 or zip.

% mlst genomes/*
genomes/6008.fna        saureus         239  arcc(2)   aroe(3)   glpf(1)   gmk_(1)   pta_(4)   tpi_(4)   yqil(3)
genomes/strep.fasta.gz  ssuis             1  aroA(1)   cpn60(1)  dpr(1)    gki(1)    mutS(1)   recA(1)   thrA(1)
genomes/NC_002973.gbk   lmonocytogenes    1  abcZ(3)   bglA(1)   cat(1)    dapE(1)   dat(3)    ldh(1)    lhkA(3)
genomes/L550.gbk.bz2    leptospira      152  glmU(26)  pntA(30)  sucA(28)  tpiA(35)  pfkB(39)  mreA(29)  caiB(29)

Without auto-detection

You can force a particular scheme (useful for reporting systems):

% mlst --scheme neisseria NM*
NM003.fa   neisseria  4821  abcZ(222)  adk(3)  aroE(58)  fumC(275)  gdh(30)  pdhC(5)  pgm(255)
NM005.gbk  neisseria  177   abcZ(7)    adk(8)  aroE(10)  fumC(38)   gdh(10)  pdhC(1)  pgm(20)
NM011.fa   neisseria  11    abcZ(2)    adk(3)  aroE(4)   fumC(3)    gdh(8)   pdhC(4)  pgm(6)
NMC.gbk.gz neisseria  8     abcZ(2)    adk(3)  aroE(7)   fumC(2)    gdh(8)   pdhC(5)  pgm(2)

You can make mlst behave like older version before auto-detection existed by providing the --legacy parameter with the --scheme parameter. In that case it will print a fixed tabular output with a heading containing allele names specific to that scheme:

% mlst --legacy --scheme neisseria *.fa
FILE      SCHEME     ST    abcZ  adk  aroE  fumC  gdh  pdhC  pgm
NM003.fa  neisseria  11    2     3    4     3       8     4    6
NM009.fa  neisseria  11149 672   3    4     3       8     4    6
MN043.fa  neisseria  11    2     3    4     3       8     4    6
NM051.fa  neisseria  11    2     3    4     3       8     4    6
NM099.fa  neisseria  1287  2     3    4    17       8     4    6
NM110.fa  neisseria  11    2     3    4     3       8     4    6

Available schemes

To see which MLST schemes are supported:

% mlst --info | csvtk -t pretty

CHEME           LOCII   TYPES   ALLELES   DATE         LOCII_NAMES
--------------   -----   -----   -------   ----------   --------------------------------------------------
mbovis           7       193     154       2025-06-25   adh1 gltX gpsA gyrB pta2 tdk tkt
mhominis_3       11      43      190       2023-11-05   eST uvrA gyrB ftsY tuf gap p120' vaa lmp1 lmp3 p60
mhyopneumoniae   3       255     254       2025-12-14   adk rpoB tpiA
mcanis           7       83      153       2019-10-21   ack cpn60 fdh pta purA sar tuf
mhyorhinis       6       265     148       2025-08-20   dnaA rpoB gyrB gltX adk gmk
mgallisepticum   7       119     249       2025-12-05   atpG dppC DUF3196 lgT mraW plsC ugpA
mflocculare      3       8       22        2018-07-03   adk rpoB tpiA
...

This outpuit is TSV by default but will honour the --csv option. The older --list and --longlist are still available for backward compatibility.

Missing data

mlst does not just look for exact matches to full length alleles. It attempts to tell you as much as possible about what it found using the notation below:

Symbol	Meaning	Length	Identity
`n`	exact intact allele	100%	100%
`~n`	novel full length allele similar to n	100%	≥ `--minid`
`n?`	partial match to known allele	≥ `--mincov`	≥ `--minid`
`-`	allele missing	< `--mincov`	< `--minid`
`n,m`	multiple alleles

Scoring system

Each MLST prediction gets a score out of 100. The score for a scheme with N alleles is as follows:

Points	For	Example
+90/N	exact allele match	`42`
+63/N	novel allele match (50% of an exact allele)	`~42`
+18/N	partial allele match (20% of an exact alelle)	`42?`
0	missing allele	`-`
+10	a matching ST type for the allele combination	`248`

It is possible to filter results using the --minscore option which takes a value between 1 and 100. If you only want to report known ST types, then use --minscore 100. To also include novel combinations of existing alleles with no ST type, use --minscore 90. The default is --minscore 50 which is an ad hoc value I have found allows for genuine partial ST matches but eliminates false positives.

Output formats

There are 3 output formats. I recommened using --full mode. By default they are TSV, but CSV can be enabled with --csv.

Default

This format does not have any column headings.

Column	Description	Example
1	Filename	`genome.gbk`
2	Scheme	`mgenitalium`
3	Sequence Type	`148`
4	Alelle 1	`adk(7)`
5	Allele 2	`atpA(1)`
6 +	Allele 3 ...	...

Full `--full` (recommended)

This preferred format has 6 columns:

Column	Description	Example
FILE	Input filename	`genome.gbk`
SCHEME	Auto-detected scheme	`mgenitalium`
ST	Sequence Type assined	`148`
STATUS	Quality of genotype	`NOVEL` (read more)
SCORE	Score of genotype	`90`
ALLELES	Indetified alleles	`adk(7);atpA(1);gmk(1);gyrB(1);pgm(3);ppa(1)`

Status

These codes are in development. Some of them are stable, but others are subject to change.

STATUS	Meaning	Stable?
PERFECT	Exact matches to a known ST	YES
NOVEL	Exact matches, but not ST yet	YES
NONE	No allele matches whatsoever	YES
MIXED	Has at least one mixed allele	YES
MISSING	Has at least one missing allele	no
BAD	If none of the above & score below 70	no
OK	If none of the above	no

Legacy `--legacy`

This format has a variable number of columns per line, depending on how many allees are in the scheme found. This makes it hard to use for mixtures of species, so you should use --full for that mode.

Column	Description	Example
FILE	Input filename	`genome.gbk`
SCHEME	From `--scheme`	`mgenitalium`
ST	Sequence Type	`148`
ALLELE_1	Allele 1 numner	`7`
ALLELE_2	Allele 2 numbner	`1`
ALLELE_n	Allele number	integer

Tweaking the output

The output is TSV (tab-separated values). This makes it easy to parse and manipulate with Unix utilities like cut and sort etc. For example, if you only want the filename and ST you can do the following:

% mlst --scheme abaumanii AB*.fasta | cut -f1,3 > ST.tsv

If you prefer CSV because it loads more smoothly into MS Excel, use the --csv option:

% mlst --csv Peptobismol.fna.gz > mlst.csv

JSON output is available too; it returns an array of dictionaries, one per input file. The id will be the same as filename unless --label is used, but that only works when scanning a single file.

% mlst -q --json out.json test/example.gbk.gz test/novel.fasta.bz2
% cat out.json
[
   {
      "scheme" : "sepidermidis",
      "alleles" : {
         "mutS" : "1",
         "yqiL" : "1",
         "tpiA" : "1",
         "pyrR" : "2",
         "gtr" : "2",
         "aroE" : "1",
         "arcC" : "16"
      },
      "sequence_type" : "184",
      "filename" : "test/example.gbk.gz",
      "id" : "test/example.gbk.gz"
   },
   {
      "sequence_type" : "-",
      "filename" : "test/novel.fasta.bz2",
      "scheme" : "spneumoniae",
      "alleles" : {
         "gki" : "2",
         "aroE" : "7",
         "ddl" : "22",
         "gdh" : "15",
         "xpt" : "1",
         "recP" : "~10",
         "spi" : "6"
      },
      "id" : "test/novel.fasta.bz2"
   }
]

You can also save the "novel" alleles for submission to PubMLST::

% mlst -q --novel nouveau.fa s_myces.fasta

% cat nouveau.fa

>streptomyces.recA-e562a2cd93e701e3b58ba0670bcbba0c s_myces.fasta
GACGTGGCCCTCGGCGTCGGCGGTCTGCCGCGCGGCCGCGTCGTCGAGATCTACGGACCGGAGTCCTCC...

The format of the sequence IDs is scheme.allele-hash filename where hash is the hexadecimal MD5 digest of the allele DNA sequence.

Mapping to genus/species

Included is a file called db/scheme_species_map.tab which has 3 tab-separated columns as follows:

#SCHEME GENUS   SPECIES
abaumannii      Acinetobacter   baumannii
abaumannii_2    Acinetobacter   baumannii
achromobacter   Achromobacter
aeromonas       Aeromonas
afumigatus      Aspergillus     afumigatus
arcobacter      Arcobacter
bburgdorferi    Borrelia        burgdorferi
bhampsonii      Brachyspira     hampsonii
bhenselae       Bartonella      henselae
borrelia        Borrelia
bpilosicoli     Brachyspira     pilosicoli
<snip>

Note that that some schemes are species specific, and others are genus specific, so the SPECIES column is empty. Note that the same species/genus can apply to multiple schemes, see abaumanii above.

Updating the bundled database

The mlst software no longer provides a script to update the database. This is because PubMLST now requires a user account and a private key to access data through the PubMLST API. You can use the mlstdb tool to help you do this.

If you do download a new database, make sure it's in /path/to/mlst/db/pubmlst and run scripts/mlst-make_blast_db before attempting to run mlst.

Adding a scheme

If you want to add a custom private scheme with mlst you can

The directory structure

Each MLST scheme exists in a folder withing the mlst/db/pubmlst folder. The name of the folder is the scheme name, say saureus for Staphylococcus aureus. It contains files like this:

% cd mlst/db/pubmlst/sareus
% ls -1
saureus.txt
arcC.tfa
aroE.tfa
glpF.tfa
gmk.tfa
pta.tfa
tpi.tfa
yqiL.tfa

The folder name (ie. saureus) must be the same name as the scheme file (ie. saureus.txt) or it will not work.

The scheme file

The saureus.txt is a tab-separated file containing one ST definition per row. The header line must be present. Extra columns with names mlst_clade,clonal_complex,species,CC,Lineage are ignored.

% head -n 5 saureus.txt
ST      arcC    aroE    glpF    gmk     pta     tpi     yqiL    clonal_complex
1       1       1       1       1       1       1       1
2       2       2       2       2       2       2       26
3       1       1       1       9       1       1       12
4       10      10      8       6       10      3       2

The allele sequence files

Each of the .tfa files are nucleotide FASTA files with the allele sequences for each locus. There must be a .tfa file for each and every allele locus in the TSV scheme .txt file. Here is what the arcC.tfa file looks like:

% head -n 20 arcC.tfa
>arcC_1
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAGGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTTACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTCAATAACCCAACCAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGACTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG
>arcC_2
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAAGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTAACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTTGATAACCCAACTAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGGCTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG

The FASTA sequence IDs must be named as >allele_number or >allele-number. Ideally the sequences will not contain any ambiguous IUPAC symbols. i.e. just A,T,C,G.

Adding a new scheme

Make a new folder in mlst/db/pubmlst/SCHEME
Put your SCHEME.txt file in there
Put your ALLELE.tfa files in there
Run mlst/scripts/mlst-make_blast_db to update the BLAST indices
Run mlst --info | grep SCHEME to see if it exists
Run mlst --scheme SCHEME file.fasta to see if it works

If it doesn't - go back and check you really did do Step 4 above.

Citations

The mlst software incorporates components of the PubMLST database which must be cited in any publications that use mlst:

"This publication made use of the PubMLST website (https://pubmlst.org/) developed by Keith Jolley Wellcome Open Res. 2018 Sep 24:3:124 and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust".

You should also cite this software (currently unpublished) as:

Seemann T, mlst Github https://github.com/tseemann/mlst

Feedback

Please submit via the Github Issues page

Licence

GPL v2

Author

Torsten Seemann

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
.github/workflows		.github/workflows
bin		bin
db		db
perl5		perl5
scripts		scripts
test		test
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mlst

Quick Start

Installation

Conda

Source

Usage

Without auto-detection

Available schemes

Missing data

Scoring system

Output formats

Default

Full `--full` (recommended)

Status

Legacy `--legacy`

Tweaking the output

Mapping to genus/species

Updating the bundled database

Adding a scheme

The directory structure

The scheme file

The allele sequence files

Adding a new scheme

Citations

Feedback

Licence

Author

About

Uh oh!

Releases 28

Packages

Uh oh!

Contributors 6

Languages

License

tseemann/mlst

Folders and files

Latest commit

History

Repository files navigation

mlst

Quick Start

Installation

Conda

Source

Usage

Without auto-detection

Available schemes

Missing data

Scoring system

Output formats

Default

Full --full (recommended)

Status

Legacy --legacy

Tweaking the output

Mapping to genus/species

Updating the bundled database

Adding a scheme

The directory structure

The scheme file

The allele sequence files

Adding a new scheme

Citations

Feedback

Licence

Author

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 28

Packages 0

Uh oh!

Contributors 6

Languages

Full `--full` (recommended)

Legacy `--legacy`

Packages