Scan contig files against traditional PubMLST typing schemes
% mlst contigs.fa
contigs.fa neisseria 11149 abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)
% mlst genome.gbk.gz
genome.gbk.gz sepidermidis 184 arcC(16) aroE(1) gtr(2) mutS(1) pyrR(2) tpiA(1) yqiL(1)
% mlst --full new.fa
FILE SCHEME ST STATUS SCORE ALLELES
new.fa mgenitalium - NOVEL 90 MLST_adk(7);MLST_atpA(1);MLST_gmk(1);MLST_gyrB(1);MLST_pgm(3);MLST_ppa(1)
% mlst --label Anthrax GCF_001941925.1_ASM194192v1_genomic.fna.bz2
Anthrax bcereus - glp(24) gmk(1) ilv(~83) pta(1) pur(~71) pyc(37) tpi(41)
% mlst --nopath /opt/data/refseq/S_pyogenes/*.fna
NC_018936.fna spyogenes 28 gki(4) gtr(3) murI(4) mutS(4) recP(4) xpt(2) yqiL(4)
NC_017596.fna spyogenes 11 gki(2) gtr(6) murI(1) mutS(2) recP(2) xpt(2) yqiL(2)
NC_008022.fna spyogenes 55 gki(11) gtr(9) murI(1) mutS(9) recP(2) xpt(3) yqiL(4)
NC_006086.fna spyogenes 382 gki(5) gtr(52) murI(5) mutS(5) recP(5) xpt(4) yqiL(3)
NC_008024.fna spyogenes - gki(5) gtr(11) murI(8) mutS(5) recP(15?) xpt(2) yqiL(1)
NC_017040.fna spyogenes 172 gki(56) gtr(24) murI(39) mutS(7) recP(30) xpt(2) yqiL(33)
% mlst --full --fofn files.txt --csv --outfile mlst.csv
# data saved in 'mlst.csv'
If you are using Conda
% conda install -c conda-forge -c bioconda mlst
% cd $HOME
% git clone https://github.com/tseemann/mlst.git
% $HOME/mlst/bin/mlst --help
Simply just give it a genome file in FASTA/GenBank/EMBL format, optionally compressed with gzip, zip or bzip2.
% mlst contigs.fa
contigs.fa neisseria 11149 abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)
It returns a tab-separated line containing
- the filename
- the matching PubMLST scheme name
- the ST (sequence type)
- the allele IDs
You can give it multiple files at once, and they can be in FASTA/GenBank/EMBL format, and even compressed with gzip, bzip2 or zip.
% mlst genomes/*
genomes/6008.fna saureus 239 arcc(2) aroe(3) glpf(1) gmk_(1) pta_(4) tpi_(4) yqil(3)
genomes/strep.fasta.gz ssuis 1 aroA(1) cpn60(1) dpr(1) gki(1) mutS(1) recA(1) thrA(1)
genomes/NC_002973.gbk lmonocytogenes 1 abcZ(3) bglA(1) cat(1) dapE(1) dat(3) ldh(1) lhkA(3)
genomes/L550.gbk.bz2 leptospira 152 glmU(26) pntA(30) sucA(28) tpiA(35) pfkB(39) mreA(29) caiB(29)
You can force a particular scheme (useful for reporting systems):
% mlst --scheme neisseria NM*
NM003.fa neisseria 4821 abcZ(222) adk(3) aroE(58) fumC(275) gdh(30) pdhC(5) pgm(255)
NM005.gbk neisseria 177 abcZ(7) adk(8) aroE(10) fumC(38) gdh(10) pdhC(1) pgm(20)
NM011.fa neisseria 11 abcZ(2) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)
NMC.gbk.gz neisseria 8 abcZ(2) adk(3) aroE(7) fumC(2) gdh(8) pdhC(5) pgm(2)
You can make mlst behave like older version before auto-detection existed
by providing the --legacy parameter with the --scheme parameter. In that case
it will print a fixed tabular output with a heading containing allele names specific to that scheme:
% mlst --legacy --scheme neisseria *.fa
FILE SCHEME ST abcZ adk aroE fumC gdh pdhC pgm
NM003.fa neisseria 11 2 3 4 3 8 4 6
NM009.fa neisseria 11149 672 3 4 3 8 4 6
MN043.fa neisseria 11 2 3 4 3 8 4 6
NM051.fa neisseria 11 2 3 4 3 8 4 6
NM099.fa neisseria 1287 2 3 4 17 8 4 6
NM110.fa neisseria 11 2 3 4 3 8 4 6
To see which MLST schemes are supported:
% mlst --info | csvtk -t pretty
CHEME LOCII TYPES ALLELES DATE LOCII_NAMES
-------------- ----- ----- ------- ---------- --------------------------------------------------
mbovis 7 193 154 2025-06-25 adh1 gltX gpsA gyrB pta2 tdk tkt
mhominis_3 11 43 190 2023-11-05 eST uvrA gyrB ftsY tuf gap p120' vaa lmp1 lmp3 p60
mhyopneumoniae 3 255 254 2025-12-14 adk rpoB tpiA
mcanis 7 83 153 2019-10-21 ack cpn60 fdh pta purA sar tuf
mhyorhinis 6 265 148 2025-08-20 dnaA rpoB gyrB gltX adk gmk
mgallisepticum 7 119 249 2025-12-05 atpG dppC DUF3196 lgT mraW plsC ugpA
mflocculare 3 8 22 2018-07-03 adk rpoB tpiA
...
This outpuit is TSV by default but will
honour the --csv option.
The older --list and --longlist are still
available for backward compatibility.
mlst does not just look for exact matches to full length alleles.
It attempts to tell you as much as possible about what it found using the
notation below:
| Symbol | Meaning | Length | Identity |
|---|---|---|---|
n |
exact intact allele | 100% | 100% |
~n |
novel full length allele similar to n | 100% | ≥ --minid |
n? |
partial match to known allele | ≥ --mincov |
≥ --minid |
- |
allele missing | < --mincov |
< --minid |
n,m |
multiple alleles |
Each MLST prediction gets a score out of 100. The score for a scheme with N alleles is as follows:
| Points | For | Example |
|---|---|---|
| +90/N | exact allele match | 42 |
| +63/N | novel allele match (50% of an exact allele) | ~42 |
| +18/N | partial allele match (20% of an exact alelle) | 42? |
| 0 | missing allele | - |
| +10 | a matching ST type for the allele combination | 248 |
It is possible to filter results using the --minscore option which takes a
value between 1 and 100. If you only want to report known ST types, then use
--minscore 100. To also include novel combinations of existing alleles with
no ST type, use --minscore 90. The default is --minscore 50 which is an
ad hoc value I have found allows for genuine partial ST matches
but eliminates false positives.
There are 3 output formats.
I recommened using --full mode.
By default they are TSV,
but CSV can be enabled with --csv.
This format does not have any column headings.
| Column | Description | Example |
|---|---|---|
| 1 | Filename | genome.gbk |
| 2 | Scheme | mgenitalium |
| 3 | Sequence Type | 148 |
| 4 | Alelle 1 | adk(7) |
| 5 | Allele 2 | atpA(1) |
| 6 + | Allele 3 ... | ... |
This preferred format has 6 columns:
| Column | Description | Example |
|---|---|---|
| FILE | Input filename | genome.gbk |
| SCHEME | Auto-detected scheme | mgenitalium |
| ST | Sequence Type assined | 148 |
| STATUS | Quality of genotype | NOVEL (read more) |
| SCORE | Score of genotype | 90 |
| ALLELES | Indetified alleles | adk(7);atpA(1);gmk(1);gyrB(1);pgm(3);ppa(1) |
These codes are in development. Some of them are stable, but others are subject to change.
| STATUS | Meaning | Stable? |
|---|---|---|
| PERFECT | Exact matches to a known ST | YES |
| NOVEL | Exact matches, but not ST yet | YES |
| NONE | No allele matches whatsoever | YES |
| MIXED | Has at least one mixed allele | YES |
| MISSING | Has at least one missing allele | no |
| BAD | If none of the above & score below 70 | no |
| OK | If none of the above | no |
This format has a variable number of columns
per line, depending on how many allees are
in the scheme found. This makes it hard to
use for mixtures of species, so you should use
--full for that mode.
| Column | Description | Example |
|---|---|---|
| FILE | Input filename | genome.gbk |
| SCHEME | From --scheme |
mgenitalium |
| ST | Sequence Type | 148 |
| ALLELE_1 | Allele 1 numner | 7 |
| ALLELE_2 | Allele 2 numbner | 1 |
| ALLELE_n | Allele number | integer |
The output is TSV (tab-separated values). This makes it easy to parse and manipulate with Unix utilities like cut and sort etc. For example, if you only want the filename and ST you can do the following:
% mlst --scheme abaumanii AB*.fasta | cut -f1,3 > ST.tsv
If you prefer CSV because it loads more smoothly into MS Excel, use the --csv option:
% mlst --csv Peptobismol.fna.gz > mlst.csv
JSON output is available too; it returns an array of dictionaries, one per
input file. The id will be the same as filename unless --label is
used, but that only works when scanning a single file.
% mlst -q --json out.json test/example.gbk.gz test/novel.fasta.bz2
% cat out.json
[
{
"scheme" : "sepidermidis",
"alleles" : {
"mutS" : "1",
"yqiL" : "1",
"tpiA" : "1",
"pyrR" : "2",
"gtr" : "2",
"aroE" : "1",
"arcC" : "16"
},
"sequence_type" : "184",
"filename" : "test/example.gbk.gz",
"id" : "test/example.gbk.gz"
},
{
"sequence_type" : "-",
"filename" : "test/novel.fasta.bz2",
"scheme" : "spneumoniae",
"alleles" : {
"gki" : "2",
"aroE" : "7",
"ddl" : "22",
"gdh" : "15",
"xpt" : "1",
"recP" : "~10",
"spi" : "6"
},
"id" : "test/novel.fasta.bz2"
}
]
You can also save the "novel" alleles for submission to PubMLST::
% mlst -q --novel nouveau.fa s_myces.fasta
% cat nouveau.fa
>streptomyces.recA-e562a2cd93e701e3b58ba0670bcbba0c s_myces.fasta
GACGTGGCCCTCGGCGTCGGCGGTCTGCCGCGCGGCCGCGTCGTCGAGATCTACGGACCGGAGTCCTCC...
The format of the sequence IDs is scheme.allele-hash filename where
hash is the hexadecimal MD5 digest of the allele DNA sequence.
Included is a file called db/scheme_species_map.tab which has 3
tab-separated columns as follows:
#SCHEME GENUS SPECIES
abaumannii Acinetobacter baumannii
abaumannii_2 Acinetobacter baumannii
achromobacter Achromobacter
aeromonas Aeromonas
afumigatus Aspergillus afumigatus
arcobacter Arcobacter
bburgdorferi Borrelia burgdorferi
bhampsonii Brachyspira hampsonii
bhenselae Bartonella henselae
borrelia Borrelia
bpilosicoli Brachyspira pilosicoli
<snip>
Note that that some schemes are species specific, and others are genus
specific, so the SPECIES column is empty. Note that the same
species/genus can apply to multiple schemes, see abaumanii above.
The mlst software no longer provides a script
to update the database. This is because PubMLST
now requires a user account and a private key
to access data through the
PubMLST API.
You can use the
mlstdb
tool to help you do this.
If you do download a new database, make
sure it's in /path/to/mlst/db/pubmlst
and run scripts/mlst-make_blast_db before
attempting to run mlst.
If you want to add a custom private scheme
with mlst you can
Each MLST scheme exists in a folder withing the mlst/db/pubmlst folder.
The name of the folder is the scheme name, say saureus for
Staphylococcus aureus. It contains files like this:
% cd mlst/db/pubmlst/sareus
% ls -1
saureus.txt
arcC.tfa
aroE.tfa
glpF.tfa
gmk.tfa
pta.tfa
tpi.tfa
yqiL.tfa
The folder name (ie. saureus) must be the same name
as the scheme file (ie. saureus.txt) or it will not work.
The saureus.txt is a tab-separated file containing one ST definition
per row. The header line must be present. Extra columns with names
mlst_clade,clonal_complex,species,CC,Lineage are ignored.
% head -n 5 saureus.txt
ST arcC aroE glpF gmk pta tpi yqiL clonal_complex
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 26
3 1 1 1 9 1 1 12
4 10 10 8 6 10 3 2
Each of the .tfa files are nucleotide FASTA files with the allele
sequences for each locus. There must be a .tfa file for each and every
allele locus in the TSV scheme .txt file. Here is what the arcC.tfa
file looks like:
% head -n 20 arcC.tfa
>arcC_1
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAGGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTTACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTCAATAACCCAACCAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGACTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG
>arcC_2
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAAGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTAACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTTGATAACCCAACTAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGGCTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG
The FASTA sequence IDs must be named as >allele_number or
>allele-number. Ideally the sequences will not contain any
ambiguous IUPAC symbols. i.e. just A,T,C,G.
- Make a new folder in
mlst/db/pubmlst/SCHEME - Put your
SCHEME.txtfile in there - Put your
ALLELE.tfafiles in there - Run
mlst/scripts/mlst-make_blast_dbto update the BLAST indices - Run
mlst --info | grep SCHEMEto see if it exists - Run
mlst --scheme SCHEME file.fastato see if it works
If it doesn't - go back and check you really did do Step 4 above.
The mlst software incorporates components of the
PubMLST database
which must be cited in any publications that use mlst:
"This publication made use of the PubMLST website (https://pubmlst.org/) developed by Keith Jolley Wellcome Open Res. 2018 Sep 24:3:124 and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust".
You should also cite this software (currently unpublished) as:
- Seemann T,
mlstGithub https://github.com/tseemann/mlst
Please submit via the Github Issues page