Skip to content

11. Genome annotation

rprops edited this page May 3, 2017 · 1 revision

Analysing Metagenome annotation by IMG

Make sure to have MinPath installed.

Add Minpath to .bashrc

PATH=$PATH:"/home/usr/MinPath"

Before running type:

export MinPath=/home/rprops/MinPath

Format .EC file from IMG annotation and run MinPath.

awk -F "\t" '{print $1 "\t" $3}' 121832.assembled.EC > for_MinPath.ec
sed -i "s/EC://g" for_MinPath.ec
MinPath1.2.py -any for_MinPath.ec -map /home/rprops/MinPath/data/ec2path -report report.metacyc.minpath

Ouput has following format:

[rprops@nyx7010 IMG_annotation_co-assembly]$ head P121832.assembled.metacyc.minpath
path 1 any n/a  naive 1  minpath 1  fam0  5  fam-found  5  name  PWY0-1479
path 2 any n/a  naive 1  minpath 1  fam0  39  fam-found  23  name  PWY-6146
path 3 any n/a  naive 1  minpath 1  fam0  4  fam-found  2  name  ASPASN-PWY
path 4 any n/a  naive 1  minpath 1  fam0  8  fam-found  8  name  BRANCHED-CHAIN-AA-SYN-PWY
path 5 any n/a  naive 1  minpath 1  fam0  17  fam-found  16  name  COMPLETE-ARO-PWY
path 6 any n/a  naive 1  minpath 1  fam0  18  fam-found  15  name  P4-PWY
path 7 any n/a  naive 1  minpath 1  fam0  4  fam-found  3  name  PWY-3481
path 8 any n/a  naive 1  minpath 1  fam0  2  fam-found  1  name  PWY-5905
path 10 any n/a  naive 1  minpath 1  fam0  14  fam-found  10  name  PWY-724
path 11 any n/a  naive 1  minpath 1  fam0  9  fam-found  8  name  PWY-821

For visualization of pathways in iPath:

for i in $(awk '{print $14}' P121832.assembled.metacyc.minpath.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
awk '{gsub("^map","",$14);print $14" #ff0000"}' P121832.assembled.metacyc.minpath.kegg > P121832.assembled.metacyc.minpath.kegg.ipath

Upload this file to iPath.

For visualization with Krona:

cd
git clone https://github.com/EnvGen/metagenomics-workshop.git
cd-

Make sure to adjust paths to your own.

genes.to.kronaTable.py -i for_MinPath.ec -m /home/rprops/metagenomics-workshop/reference_db/metacyc/ec.to.pwy -H /home/rprops/metagenomics-workshop/reference_db/metacyc/pwy.hierarchy -n P121832.assembled -l <(grep "minpath 1" P121832.assembled.metacyc.minpath) -o P121832.assembled.krona.metacyc.minpath.tab
ktImportText -o P121832.assembled.krona.metacyc.minpath.html P121832.assembled.krona.metacyc.minpath.tab

genes.to.kronaTable.py -i for_MinPath.ec -m /home/rprops/metagenomics-workshop/reference_db/kegg/ec.to.pwy -H /home/rprops/metagenomics-workshop/reference_db/kegg/pwy.hierarchy -n P121832.assembled -l <(grep "minpath 1" P121832.assembled.metacyc.minpath) -o P121832.assembled.krona.kegg.minpath.tab
ktImportText -o P121832.assembled.krona.kegg.minpath.html P121832.assembled.krona.kegg.minpath.tab

Analysing IMG output for the Genome from metagenome annotation

Use the KO annotation file (*.ko.tab.txt):

awk -F "\t" '{print $10}' 2724679690.ko.tab.txt | sed "s/KO://g" | awk '{print $0, "#ff0000"}' > 2724679690.ko.tab.txt_IPATH

Upload this file to iPath.

For STAMP analysis get the profile files from here: https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=AbundanceProfiles&page=mergedForm

After installation on Windows make sure to move the data directory to the right place (this is a known unresolved issue.

Then format the output with this RScript for compatibility with STAMP:

#!/usr/bin/env Rscript
userprefs <- commandArgs(trailingOnly = TRUE)
file_name <- userprefs[1]
file_output <- userprefs[2]

return_first <- function(x){
  y <- x[1]
  return(y)
}

df <- read.delim(file_name)
dim(df)

# temp files
tmp1 <- do.call(rbind,by(df[, 3:ncol(df)], INDICES = factor(df$Func_name), 
                 FUN = colSums))

tmp2 <- aggregate(Func_id~Func_name, data = df, FUN = return_first)

# merge
output <- cbind(tmp2, tmp1)
output <- output[,c(2,1,3:ncol(output))]

# write
write.table(output, file = file_output, sep = "\t", row.names = FALSE, quote=FALSE)

Usage:

format_STAMP.R input_name desired_outputname

2. Visualizing GC content in genes (CDS)

Install easyGgplot2 and run the following R script on one or multiple gff files you get from IMG. This will generate a figure showing the distribution across all genes per genome. This also generate a tab separated file matching sequence ID and GC% (seqid_GC.tsv) for mapping GC content of a gene to its function (see below)

Rscript gc_plot.R 121950.assembled.gff 121951.assembled.gff
#!/usr/bin/env Rscript
library(easyGgplot2)

userprefs <- commandArgs(trailingOnly = TRUE)
print(userprefs)

for(i in userprefs){
  d1 <- read.table(i, as.is = TRUE, sep = "\t")
  d2 <- read.table(text = d1[[9]], sep = ";")
  d <- cbind(d1[,1], d2[,c(1,3)])
  colnames(d) <- c("contig", "contig_geneID", "GC")
  
  # Remove unwanted characters
  d$GC <- gsub(d$GC, pattern = "gc_cont=", replacement = "")
  d$contig_geneID <- gsub(d$contig_geneID, pattern = "ID=", replacement = "")
  d$contig_geneID <- gsub(d$contig_geneID, pattern = ".", replacement = "",
                          fixed = TRUE)
  d$GC <- as.numeric(d$GC); d <- d[!is.na(d$GC),]
  d <- data.frame(d, Genome = rep(i, nrow(d)))
  
  if(i!=userprefs[1]) d.tot <- rbind(d.tot, d) else d.tot <- d
}

# Export data
write.table(d.tot, "seqid_GC.tsv", row.names = FALSE,
quote = FALSE)

# Plot
png("GC_inGenes.png", res = 500, height = 5, width = 6, units = "in")
ggplot2.histogram(data = d.tot, xName = 'GC',
                  groupName = 'Genome', alpha = 0.5,
                  legendPosition = "top", binwidth = 0.01)
dev.off()

3. Link GC content to annotation

Generate gene_oid to seq_id file (2724679690.genes.faa file is part of IMG output):

grep ">" 2724679690.genes.faa | sed "s/>//g" | awk '{print $1, $2}' > gene_oid_2_seq_id.txt

Then run the following R script. This uses the seqid_GC.tsv and gene_oid_2_seq_id.txt files from the previous scripts as well as the function file generated by IMG (e.g. 2724679690.cog.tab.txt or 2724679690.pfam.tab.txt). The GC cutoff is the final positional argument (e.g. 0.75). The output is a file matching GC content of the gene and its inferred function: GC_function.tsv.

Example usage:

RScript gc2function.R  seqid_GC.tsv gene_oid_2_seq_id.txt 2724679690.cog.tab.txt 0.75
#!/usr/bin/env Rscript
userprefs <- commandArgs(trailingOnly = TRUE)
print(userprefs)

library("dplyr")

# Import data

seqid_GC <- read.table(userprefs[1],
                       header = TRUE, stringsAsFactors = FALSE, sep = " ")
gene_oid_2_seq_id <- read.table(userprefs[2],
                                stringsAsFactors = TRUE, col.names = c("gene_oid", "seq_id"),
                                colClasses = "character", sep = " ")
function_file <- read.table(userprefs[3],
                            stringsAsFactors = FALSE,  sep = "\t", header = TRUE,
                            quote=NULL, comment='')
function_file$gene_oid <- as.character(function_file$gene_oid)
gc_thresh <- userprefs[4]

# Merge files
merged_df <- dplyr::inner_join(seqid_GC, gene_oid_2_seq_id, by = c("contig_geneID" = "seq_id"))
merged_df <- dplyr::inner_join(merged_df, function_file, by = "gene_oid")
tot_genes <- nrow(merged_df)
  
# Filter out based on threshold
merged_df <- merged_df[merged_df$GC>gc_thresh,]
  
# Rank from large to small
merged_df <- merged_df[order(merged_df$GC), ]

# Print brief summary
cat(date(), ' --- There are', nrow(merged_df), "genes with >", gc_thresh, "%\n" ,sep = " ")
cat(date(), ' --- This is', round(100*nrow(merged_df)/tot_genes,2), "% of all genes",sep = " ")
cat(date(), ' --- The 20 genes with the highest GC% are:\n', sep = " ")
output <- tail(data.frame(function_id = merged_df[, ncol(merged_df)-2], function_name = merged_df[, ncol(merged_df)-1], 
                          GC = 100*merged_df$GC), n = 25)
print(output)

# Write output table
write.table(merged_df, "GC_function.tsv",
            row.names = FALSE, quote = FALSE)

Clone this wiki locally