Question

Help needed interpreting/integrating WebMGA KOG results onto transcriptome (FASTA)

1

Entering edit mode

8.1 years ago

moldach ▴ 130

Dear colleagues,

I've done an extensive search on Google as well as the forums on Seqanswers and Biostars and don't believe this question has been covered (in this depth) before.

I've used WebMGA (http://weizhong-lab.ucsd.edu/metagenomic-analysis/server/kog/) for functional annotation (KOG) of my transcriptome (FASTA) file. However, I need some help interpreting/integrating these results.

Briefly, I did a de novo assembly and annotated it using dammit. I used this as input for WebMGA and it outputs a number of files. With the help of a previous Biostars post (which links to this stackoverflow post) I was able to use one of the output files (output.2.class: counts by class) to make a histogram of KOG classes from my assembly.

However, I would also like to append the KOG results (i think from output.2: long table of rpsblast hits?) onto my transcriptome. So in addition to gene names that were extracted from UniRef90 (and appended) to my transcriptome from dammit I want KOG information as well.

I understand I will likely need to write a custom script to this; however, I have very little scripting experience. For example, I was fairly easily able to make changes to the stackoverflow R code because I've done most of my programming using R. There must be a simple way to do this on the linux command line? If someone has experience attaching functional annotations (KOG or even GO) to assemblies I would really appreciate your support and tutelage. Similarly, if a code exists for a similar (but not identical task) I would be willing to try and cannibalizing/re-purpose the script.

Regards,

A frustrated but motivated student

P.S. To show you that I have been thinking about how to go about this let me describe what I want (albeit in pseudo-code)

Script should:

1) Look at FASTA file and KOG output (in the first column $1) and when sequences match

2) extract subset of data from KOG output (description column $11) with awk and append it to the end of that header in the FASTA file

RNA-Seq Assembly Annotation KOG • 2.6k views

ADD COMMENT • link updated 8.1 years ago by Philipp Bayer 8.8k • written 8.1 years ago by moldach ▴ 130

score 1 · Answer 1 · 2017-06-12

1

Entering edit mode

8.1 years ago

Philipp Bayer 8.8k

Hmm I'd do this using Biopython because it's easier.

The following assumes that the kog file fits into memory, is tab-delimited, and has the gene name in column 1. It's also got bad argumentparsing, argparse would be the way to go. It's untested.

from Bio import SeqIO
import sys

kogs_dict = {} # key: gene name, value: kog result
try:
  kog_fh = open(sys.argv[1])
except IndexError:
  sys.stderr.write('Usage: name_of_script kog_file fasta_file\n')
  sys.exit(0)

fasta_f = SeqIO.parse(sys.argv[2], 'fasta') # won't crash if file doesn't exist, not nice

for line in kog_fh:
  linelist = line.rstrip().split('\t')
  gene_name = linelist[0] # ASSUMPTION: gene is in first column
  kog_result = linelist[10] # counts from zero so column 11 has index 10
  assert gene_name not in kogs_dict # sanity check: we should have only one kog result per gene, I guess?
  kogs_dict[gene_name] = kog_result

for s in fasta_f:
  # we want to add the KOG information to the sequence's description, which is everything after the space in the > header
  # this assumes that the gene's ID (everything before the space) is in the kogs output, not the entire name
  if s.id in kogs_dict:
    kog = kogs_dict[s.id]
  else:
    kog = 'NA' 
  s.description = '%s KOG=%s'%(s.description, kog)
  print(s.format('fasta'), end='') # python 3
  # if python2: print(s.format('fasta')), with a comma at the end, else you'll get an extra linebreak

This code just prints to standard out, feel free to reroute somewhere

ADD COMMENT • link 8.1 years ago by Philipp Bayer 8.8k

0

Entering edit mode

Thank you very much for the quick response Philipp!

Unfortunately, to date my experience has been in the shell and R programming. So I'm currently unfamiliar with Python. Although shell is very easy and quick, once the pattern gets complex it appears you should use Python rather than bashing your head against the wall to come up with a solution using pure shell.

Biopython looks like an amazing resource for bioinformatics, so thank you for the suggestion - I will be taking this opportunity to learn it.

I tried to get my feet-wet by opening the python interpreter on AWS EC2 Ubuntu 14 and running this script (both interactively and as a .py script). There were no errors (I'm using Python2) but the output didn't have the KOG terms attached to the FASTA.

>Transcript_122 len=312
GGGGATATAATTATTTATTGATCTATATGAGCATTTTCTCTTGTGATCTAATTGTCTGAATTTATGAGTGGCAAGATAGCAAATCACTCGTCTGCAAGGTGGCAGTTGTGCAAATGCAGCACTGTTTTATATACTATAGCACTTTGTTGTTCAATTTGTTGTTTTGTGTTTATACATTGATCATTCTCATTGTTTTTTTGTTGTTGTTGATATTCTTTTCTTTCCTTCTTTCTTTCTTTTGTTCTATAATCTTCATAGGAGACACCTCAAAGTAAAATATTGTGTGGGATATTTTCATAAATTTATTAAAG

I'm not sure where the problem lies. However, one potential future issue I see is assert gene_name not in kogs_dict # sanity check: we should have only one kog result per gene, I guess?

From a cursory view of my KOG results I noticed that there may be some genes with multiple KOG results (see Transcript_506 in KOG file)

I've attached test-files for anyone willing to tackle this: 1) Fasta: http://www.filedropper.com/fasta 2) KOG: http://www.filedropper.com/kog

ADD REPLY • link 8.1 years ago by moldach ▴ 130

1

Entering edit mode

I see! I slightly changed the code - I now use a defaultdict, which by default stores an empty list for any possible key, so I can just append to that.

I run it like this:

python3 compare.py KOG.2 FASTA.fasta

Code is:

from Bio import SeqIO
import sys
from collections import defaultdict

try:
  kog_fh = open(sys.argv[1])
except IndexError:
  sys.stderr.write('Usage: name_of_script kog_file fasta_file\n')
  sys.exit(0)

fasta_f = SeqIO.parse(sys.argv[2], 'fasta') # won't crash if file doesn't exist, not nice

kogs_dict = defaultdict(list)# key: gene name, value: kog result

for line in kog_fh:
  linelist = line.rstrip().split('\t')
  gene_name = linelist[0] # ASSUMPTION: gene is in first column
  kog_result = linelist[10] # counts from zero so column 11 has index 10
  kogs_dict[gene_name].append(kog_result) # we can have several kogs per gene

for s in fasta_f:
  # we want to add the KOG information to the sequence's description, which is everything after the space in the > header
  # this assumes that the gene's ID (everything before the space) is in the kogs output, not the entire name
  kog = kogs_dict[s.id] # with a defaultdict, the default entry is an empty list
  if not kog: # empty lists are 'falsy'
      kog = 'NA'
  else:
      kog = ';'.join(kog)
  s.description = '%s KOG=%s'%(s.description, kog)
  print(s.format('fasta'), end='') # python 3
  # if python2: print(s.format('fasta')), with a comma at the end, else you'll get an extra linebreak

Prints stuff like:

>Transcript_506 len=230 KOG=Nuclear pore complex, p54 component (sc Nup57);Uncharacterized conserved protein ATTCGTTCATCTTCCAGTGCCCATTGTGGGCCTCAATGCACTTGGAAGGCAAATGATGCG ATCAGTGAAGAAGTTCTGGGAACCGACCAGTTGGAACAAGAAGAAGCTGAACAGGCTATC TTGGGGCATCAGCTTTCTAGTCTTGGCGAAATGGCTCTGGGACGATGAGCAGCAAAACCC GAAATGGTTCCAGGAGGTGCCAACCATTCTGGGAGTGCCCCAGGCGCGCG

If there are several KOG terms I separated them using a semicolon, so we have two terms here.

ADD REPLY • link 8.1 years ago by Philipp Bayer 8.8k