Converting Gene Predictions In Glimmer To Protein Sequences.
0
0
Entering edit mode
9.3 years ago
genomelover ▴ 50

Hi to all, I have used GLIMMER 3.2 to predict genes in the bacterial genomes and I get the ORF ID and not the protein sequences. These sequences have to be used in the another tool and it should be in FASTA FORMAT.So I have used a biopython script to convert gene predictions in GFF3 format to protein sequences.I am using .predict file of GLIMMER which I am not sure as it not exactly in GFF3 format, tried checking it using GFF3 validator and only the first few lines are in GFF3 format, GFF3 format indeed is divided into many kinds which I read many days before I am still not able to figure out which format is the GLIMMER output.The biopython script I am running gives the error. Plz help as I new to the world of bioinformatics,LINUX.The biopython script is here

#!/usr/bin/env python
"""Convert GlimmerHMM GFF3 gene predictions into protein sequences.

This works with the GlimmerHMM GFF3 output format:

##gff-version 3
##sequence-region Contig5.15 1 47390
Contig5.15      GlimmerHMM      mRNA    323     325     .       +       .       ID=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1
Contig5.15      GlimmerHMM      CDS     323     325     .       +       0       ID=Contig5.15.cds1.1;Parent=Contig5.15.path1.gene1;Name=Contig5.15.path1.gene1;Note=final-exon

http://www.cbcb.umd.edu/software/GlimmerHMM/

Usage:
glimmergff_to_proteins.py <glimmer gff3> <ref fasta>
"""
from __future__ import with_statement
import sys
import os
import operator

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

from BCBio import GFF

def main(glimmer_file, ref_file):
with open(ref_file) as in_handle:
ref_recs = SeqIO.to_dict(SeqIO.parse(in_handle, "fasta"))

base, ext = os.path.splitext(glimmer_file)
out_file = "%s-proteins.fa" % base
with open(out_file, "w") as out_handle:
SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")

def protein_recs(glimmer_file, ref_recs):
"""Generate protein records from GlimmerHMM gene predictions.
"""
with open(glimmer_file) as in_handle:
for rec in glimmer_predictions(in_handle, ref_recs):
for feature in rec.features:
seq_exons = []
for cds in feature.sub_features:
seq_exons.append(rec.seq[
cds.location.nofuzzy_start:
cds.location.nofuzzy_end])
if feature.strand == -1:
gene_seq = gene_seq.reverse_complement()
protein_seq = gene_seq.translate()
yield SeqRecord(protein_seq, feature.qualifiers["ID"][0], "", "")

def glimmer_predictions(in_handle, ref_recs):
"""Parse Glimmer output, generating SeqRecord and SeqFeatures for predictions
"""
for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
yield rec

if __name__ == "__main__":
if len(sys.argv) != 3:
print __doc__
sys.exit()
main(*sys.argv[1:])


The error which I get is I have tried running this script many times and I get this error most of the times

Traceback (most recent call last):
File "glimmergff_to_proteins.py", line 62, in <module>
main(*sys.argv[1:])
File "glimmergff_to_proteins.py", line 27, in main
with open(ref_file) as in_handle:

Value Error:More than one record found in handle


Every time I run the script I get different errors,The previous error which I think was due to the multiple FASTA file format for the reference file and when the split file was run it shows the same, further splitting also shows the same error.The most recent error which I got was

Traceback (most recent call last):
File "glimmergff_to_proteins.py", line 62, in <module>
main(*sys.argv[1:])
File "glimmergff_to_proteins.py", line 33, in main
SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 426, in write
count = writer_class(fp).write_file(sequences)
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 254, in write_file
count = self.write_records(records)
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/Interfaces.py", line 238, in write_records
for record in records:
File "glimmergff_to_proteins.py", line 39, in protein_recs
for rec in glimmer_predictions(in_handle, ref_recs):
File "glimmergff_to_proteins.py", line 55, in glimmer_predictions
for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 712, in parse
target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 299, in parse_in_parts
for results in self.parse_simple(gff_files, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 320, in parse_simple
for results in self._gff_process(gff_files, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 606, in _gff_process
for out in self._lines_to_out_info(line_gen, limit_info, target_lines):
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 637, in _lines_to_out_info
results = self._map_fn(line, params)
File "/usr/local/lib/python2.7/dist-packages/bcbio_gff-0.3a-py2.7.egg/BCBio/GFF/GFFParser.py", line 157, in _gff_line_map
8w assert len(parts) >= 8, line


Please reply and help in making the script work, have been working on this for months but without any results. Please feel free to mail me (nope: all responses stay on site (md)). My mail id is [deleted]

Here is the link to the script so you can check the line number

script biopython • 6.0k views
5
Entering edit mode

Dear genomelover, it's quite bad netiquette to be shouting for help in uppercase letters, and to demand a reply ASAP...

If you also take the time to tell us what you have already tried to get rid of those errors, you will drastically improve your chances of getting a quick reply : http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002202.

2
Entering edit mode

Also, including line numbers in the script above would be helpful, as without them, it's difficult to trace the errors based on the current information? I don't know what line 27 and line 62 are.

0
Entering edit mode

0
Entering edit mode

I see it is long time you posted this question. Since I was looking for something similar, I was trying the code you suggested. I did not get any error. The problem is the following and it is often not considered by people who thinks to be good at coding:

gene predictions can be partial. Therefore the frame of the first exon can be different from the 0th (that is the first codon can be partial). It follows that a good gff to sequence script has to take this into account e.g. by trimming the sequence of the first exon before translating it. It seems to me that the script you were using does not take this fact into account. If this is so, it will output completely wrong protein sequences in those cases.