Releases: Ensembl/plant-scripts
20250904
Main changes in GET_PANGENES:
06032025: get_pangenes.pl: sort & concat alignment results using tempfile with filenames to sort to avoid "Argument list too long"
24032025: BED matrix produced by _cluster_analysis.pl is 0-based
25032025: match_cluster.pl was added -i to control sequence identity of matches
25032025: match_cluster.pl was added -F to produce a FASTA file with sequence index that can be exported as gene-based pangenome for mapping,
25032025: with <global pangenome positions> estimated from reference genome
25032025: updated Makefiles and documentation
08042025: match_cluster.pl TSV output updated, tested with barley
08042025: add pangenome coords example to documentation
14052025: added POCS to troubleshooting to explain small cores
19052025: check_quality.pl does not assume gff files are available
27052025: _cluster_analysis.pl -t now affects pangene set growth simulation
Plus changes to phylogenomics scripts described in #16
Finally, tag format was changed to 1.3 for conda compatibility
20250123
04102024
This release ships with get_pangenes.pl version 04102024.
Main changes are:
25092024: added section 'Example 6: estimation of haplotype diversity'
03102024: get_pangenes.pl expects min 95% sequence identity for WGA-based gene alignments, as in GET_HOMOLOGUES-EST, to help avoid diverged tandem copies
04102024: get_pangenes.pl now set MAXDISTNEIGHBORS=2, neighbor genes in a cluster cannot be more than 2 genes away
11012024
This release ships with updates to GET_PANGENES: code changes since the publication of the manuscript, involving:
- fixed bug in handling - strand coords in sub query2ref_coords
- sub _parseCIGARfeature handles correctly 1bp CS-type SNPs when computing overlap with optional query coord
- tested rename_pangenes.pl with MAGIC16 rice dataset, check AgBioData nomenclature rules at https://github.com/Ensembl/plant-scripts/blob/df9cfdef5e49e6f463a08e7ed8ec8a04556735ff/pangenes/rename_pangenes.pl#L5C48-L5C57 ; code to update a previous cluster set not yet in place
15112023
This release ships with updates to:
-
GET_PANGENES: code and documentation changes since the publication of the manuscript, involving improved handling of input GFF files and calculation of overlap coordinates from WGA segments in different strands.
-
REST-based recipes.
pangenes_benchmark
Pangene sets of Arabidopsis (ACK), rice, wheat and barley datasets produced while benchmarking get_pangenes as described at https://doi.org/10.1186/s13059-023-03071-z and https://www.biorxiv.org/content/10.1101/2023.01.03.520531v2
The HOWTO* files contain the actual commands required to produce these results with the input FASTA & GFF files (32GB), which should be first be downloaded from
test_rice
Toy dataset to test the scripts for pan-gene analysis.
nrTEplants
Release 0.3 (Jun2020) the nrTEplants library of plant transposable elements which minimizes overlap with sequence containing protein domains known to be part of NLR genes. This sequence set was computed after combining TREP, SINEbase, REdat, RepetDB, EDTArice, EDTAmaize, SoyBaseTE, TAIR10TE, SunflowerTE, MelonTE, RosaTE and SUNREP and obtaining a non-redundant collection with GET_HOMOLOGUES-EST.
Check the code and documentation at https://github.com/Ensembl/plant_tools/tree/master/bench/repeat_libs
Citation: Contreras-Moreira,B., Filippi,C.V., Naamati,G., Girón,C.G., Allen,J.E. and Flicek,P. (2021) Efficient masking of plant genomes by combining kmer counting and curated repeats Genomics. Plant Genome https://doi.org/10.1002/tpg2.20143
23102020
This release was created to obtain a DOI from Zenodo