-
Notifications
You must be signed in to change notification settings - Fork 4
11. Genome annotation
Make sure to have MinPath installed.
Add Minpath to .bashrc
PATH=$PATH:"/home/usr/MinPath"
Before running type:
export MinPath=/home/rprops/MinPath
Format .EC file from IMG annotation and run MinPath.
awk -F "\t" '{print $1 "\t" $3}' 121832.assembled.EC > for_MinPath.ec
sed -i "s/EC://g" for_MinPath.ec
MinPath1.2.py -any for_MinPath.ec -map /home/rprops/MinPath/data/ec2path -report report.metacyc.minpath
Ouput has following format:
[rprops@nyx7010 IMG_annotation_co-assembly]$ head P121832.assembled.metacyc.minpath
path 1 any n/a naive 1 minpath 1 fam0 5 fam-found 5 name PWY0-1479
path 2 any n/a naive 1 minpath 1 fam0 39 fam-found 23 name PWY-6146
path 3 any n/a naive 1 minpath 1 fam0 4 fam-found 2 name ASPASN-PWY
path 4 any n/a naive 1 minpath 1 fam0 8 fam-found 8 name BRANCHED-CHAIN-AA-SYN-PWY
path 5 any n/a naive 1 minpath 1 fam0 17 fam-found 16 name COMPLETE-ARO-PWY
path 6 any n/a naive 1 minpath 1 fam0 18 fam-found 15 name P4-PWY
path 7 any n/a naive 1 minpath 1 fam0 4 fam-found 3 name PWY-3481
path 8 any n/a naive 1 minpath 1 fam0 2 fam-found 1 name PWY-5905
path 10 any n/a naive 1 minpath 1 fam0 14 fam-found 10 name PWY-724
path 11 any n/a naive 1 minpath 1 fam0 9 fam-found 8 name PWY-821
For visualization of pathways in iPath:
for i in $(awk '{print $14}' P121832.assembled.metacyc.minpath.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
awk '{gsub("^map","",$14);print $14" #ff0000"}' P121832.assembled.metacyc.minpath.kegg > P121832.assembled.metacyc.minpath.kegg.ipath
Upload this file to iPath.
For visualization with Krona:
cd
git clone https://github.com/EnvGen/metagenomics-workshop.git
cd-
Make sure to adjust paths to your own.
genes.to.kronaTable.py -i for_MinPath.ec -m /home/rprops/metagenomics-workshop/reference_db/metacyc/ec.to.pwy -H /home/rprops/metagenomics-workshop/reference_db/metacyc/pwy.hierarchy -n P121832.assembled -l <(grep "minpath 1" P121832.assembled.metacyc.minpath) -o P121832.assembled.krona.metacyc.minpath.tab
ktImportText -o P121832.assembled.krona.metacyc.minpath.html P121832.assembled.krona.metacyc.minpath.tab
genes.to.kronaTable.py -i for_MinPath.ec -m /home/rprops/metagenomics-workshop/reference_db/kegg/ec.to.pwy -H /home/rprops/metagenomics-workshop/reference_db/kegg/pwy.hierarchy -n P121832.assembled -l <(grep "minpath 1" P121832.assembled.metacyc.minpath) -o P121832.assembled.krona.kegg.minpath.tab
ktImportText -o P121832.assembled.krona.kegg.minpath.html P121832.assembled.krona.kegg.minpath.tab
Use the KO annotation file (*.ko.tab.txt):
awk -F "\t" '{print $10}' 2724679690.ko.tab.txt | sed "s/KO://g" | awk '{print $0, "#ff0000"}' > 2724679690.ko.tab.txt_IPATH
Upload this file to iPath.
For STAMP analysis get the profile files from here: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=AbundanceProfiles&page=mergedForm
After installation on Windows make sure to move the data directory to the right place (this is a known unresolved issue.
Then format the output with this RScript for compatibility with STAMP:
#!/usr/bin/env Rscript
userprefs <- commandArgs(trailingOnly = TRUE)
file_name <- userprefs[1]
file_output <- userprefs[2]
return_first <- function(x){
y <- x[1]
return(y)
}
df <- read.delim(file_name)
dim(df)
# temp files
tmp1 <- do.call(rbind,by(df[, 3:ncol(df)], INDICES = factor(df$Func_name),
FUN = colSums))
tmp2 <- aggregate(Func_id~Func_name, data = df, FUN = return_first)
# merge
output <- cbind(tmp2, tmp1)
output <- output[,c(2,1,3:ncol(output))]
# write
write.table(output, file = file_output, sep = "\t", row.names = FALSE, quote=FALSE)Usage:
format_STAMP.R input_name desired_outputname
Install easyGgplot2 and run the following R script on one or multiple gff files you get from IMG. This will generate a figure showing the distribution across all genes per genome. This also generate a tab separated file matching sequence ID and GC% (seqid_GC.tsv) for mapping GC content of a gene to its function (see below)
Rscript gc_plot.R 121950.assembled.gff 121951.assembled.gff
#!/usr/bin/env Rscript
library(easyGgplot2)
userprefs <- commandArgs(trailingOnly = TRUE)
print(userprefs)
for(i in userprefs){
d1 <- read.table(i, as.is = TRUE, sep = "\t")
d2 <- read.table(text = d1[[9]], sep = ";")
d <- cbind(d1[,1], d2[,c(1,3)])
colnames(d) <- c("contig", "contig_geneID", "GC")
# Remove unwanted characters
d$GC <- gsub(d$GC, pattern = "gc_cont=", replacement = "")
d$contig_geneID <- gsub(d$contig_geneID, pattern = "ID=", replacement = "")
d$contig_geneID <- gsub(d$contig_geneID, pattern = ".", replacement = "",
fixed = TRUE)
d$GC <- as.numeric(d$GC); d <- d[!is.na(d$GC),]
d <- data.frame(d, Genome = rep(i, nrow(d)))
if(i!=userprefs[1]) d.tot <- rbind(d.tot, d) else d.tot <- d
}
# Export data
write.table(d.tot, "seqid_GC.tsv", row.names = FALSE,
quote = FALSE)
# Plot
png("GC_inGenes.png", res = 500, height = 5, width = 6, units = "in")
ggplot2.histogram(data = d.tot, xName = 'GC',
groupName = 'Genome', alpha = 0.5,
legendPosition = "top", binwidth = 0.01)
dev.off()Generate gene_oid to seq_id file (2724679690.genes.faa file is part of IMG output):
grep ">" 2724679690.genes.faa | sed "s/>//g" | awk '{print $1, $2}' > gene_oid_2_seq_id.txt
Then run the following R script. This uses the seqid_GC.tsv and gene_oid_2_seq_id.txt files from the previous scripts as well as the function file generated by IMG (e.g. 2724679690.cog.tab.txt or 2724679690.pfam.tab.txt). The GC cutoff is the final positional argument (e.g. 0.75). The output is a file matching GC content of the gene and its inferred function: GC_function.tsv.
Example usage:
RScript gc2function.R seqid_GC.tsv gene_oid_2_seq_id.txt 2724679690.cog.tab.txt 0.75
#!/usr/bin/env Rscript
userprefs <- commandArgs(trailingOnly = TRUE)
print(userprefs)
library("dplyr")
# Import data
seqid_GC <- read.table(userprefs[1],
header = TRUE, stringsAsFactors = FALSE, sep = " ")
gene_oid_2_seq_id <- read.table(userprefs[2],
stringsAsFactors = TRUE, col.names = c("gene_oid", "seq_id"),
colClasses = "character", sep = " ")
function_file <- read.table(userprefs[3],
stringsAsFactors = FALSE, sep = "\t", header = TRUE,
quote=NULL, comment='')
function_file$gene_oid <- as.character(function_file$gene_oid)
gc_thresh <- userprefs[4]
# Merge files
merged_df <- dplyr::inner_join(seqid_GC, gene_oid_2_seq_id, by = c("contig_geneID" = "seq_id"))
merged_df <- dplyr::inner_join(merged_df, function_file, by = "gene_oid")
tot_genes <- nrow(merged_df)
# Filter out based on threshold
merged_df <- merged_df[merged_df$GC>gc_thresh,]
# Rank from large to small
merged_df <- merged_df[order(merged_df$GC), ]
# Print brief summary
cat(date(), ' --- There are', nrow(merged_df), "genes with >", gc_thresh, "%\n" ,sep = " ")
cat(date(), ' --- This is', round(100*nrow(merged_df)/tot_genes,2), "% of all genes",sep = " ")
cat(date(), ' --- The 20 genes with the highest GC% are:\n', sep = " ")
output <- tail(data.frame(function_id = merged_df[, ncol(merged_df)-2], function_name = merged_df[, ncol(merged_df)-1],
GC = 100*merged_df$GC), n = 25)
print(output)
# Write output table
write.table(merged_df, "GC_function.tsv",
row.names = FALSE, quote = FALSE)