Question

Problem With A Code For An Automatic Annotation Of A Genome Using Biopython

2

Entering edit mode

10.7 years ago

jcastrofigueroa ▴ 140

Hello: I have been writing a code for automatic annotation of a genome. Basically, what I've done is to use the DNA sequence of the contig, then predict the ORFs with prodigal, then use this info for BLAST analysis and thus obtain the putative function of each predicted protein. The final goal is to have *.gbk file readable in Artemis.

The code and problems: The variable record is supposed to contain both the DNA sequence and all predicted function of the genes, obtained from BLAST analysis (the loop). As I said all info is stored in the same variable record, like this record = (SeqIO.parse(dna_file,"fasta"),SeqFeature(FeatureLocation(a2, b, strand = c), type = "CDS", qualifiers=quali)) (see the code below).

When I try to save this info as a GenBank file (final part of the code) an error is shown, also, I think that the variable record is being overwritten inside the loop which is also a problem. I don't know what is my mistake at this point, I'm stuck. Please, help!!! - any advice will help

Thanks a lot

###Import GFF reader
from BCBio.GFF import GFFExaminer
from BCBio import GFF

###Import SeqIO for parsing fasta results
from Bio import SeqIO

###Import Seq feature
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation

###Import BLAST module and XML parser
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML



###Reading *.fasta file containing the parent sequence
dna_file = "/Users/k/prodigal/640.fasta"
record = list(SeqIO.parse(dna_file,"fasta"))


###Reading *.gff file with information of gene prediction from PRODIGAL for one contig
gff_file = "/Users/k/prodigal/my640.gff"
gff_handle = open(gff_file)


###Reading .fasta files cointaining protein prediction sequences using PRODIGAL
prot_file = "/Users/k/prodigal/protseq640.fasta"
###Recording protein sequences in a list to BLAST them one by one
prot_list = list(SeqIO.parse(prot_file,"fasta"))


    ###Parsing resuts from *.gff file
for rec in GFF.parse(gff_handle):
    ###This will tell you the number of genes in the file
    entries = len(rec.features)

    ###This will extract the information of the gene you chose:
    for x in range(0,entries):
        feature = rec.features[x]

        ###Start location
        a = feature.location.start
        a2 = a+1

        ###End location
        b = feature.location.end

        ###Strand
        c = feature.strand

        ###Will retrieve the protein sequence needed
        prot_in_contig = prot_list[x].seq

        ###BLAST on line
        blast_handle = NCBIWWW.qblast("blastp", "nr", prot_in_contig)
        save_file = open("/Users/k/Desktop/my_blast.xml", "w")
        save_file.write(blast_handle.read())
        save_file.close()
        blast_handle.close()
        for record in NCBIXML.parse(open("/Users/k/Desktop/my_blast.xml")):
            if record.alignments :
                ##print record.alignments[0].hsps[0].score
                blast_match = "%s with e-value of %s" % (record.alignments[0].hit_def,
                                                         record.alignments[0].hsps[0].expect)


        ###All additional information recorded in the section "qualifiers"
        quali = {"ID":x+1,"translation":prot_in_contig,"source":"Prediction with PRODIGAL",
                 "blast_match":blast_match, "blast_match_ID":record.alignments[0].hit_id}


        ###Variable that records all features: FeatureLocation; 
        record = (SeqIO.parse(dna_file,"fasta"),SeqFeature(FeatureLocation(a2, b, strand = c), type = "CDS",
                                                           qualifiers=quali))


f_handle = open("/Users/k/Desktop/contig", "w")
SeqIO.write(record, f_handle, "genbank")

genbank annotation biopython • 5.3k views

ADD COMMENT • link updated 10.7 years ago by Peter 6.0k • written 10.7 years ago by jcastrofigueroa ▴ 140

0

Entering edit mode

If you get an error message, it is almost always useful to tell us what the error message is.n

ADD REPLY • link 10.7 years ago by Peter 6.0k

score 3 · Answer 1 · 2013-08-20

Evidently my answer on creating embl file using biopython - genome annotation was too terse/short.

In addition to the points raised by Phillip, there are several possible confusions or problems,

record = list(SeqIO.parse(...))

That gives you a list of SeqRecord objects, yet for the variable name (record is singular) you appear to expect just one SeqRecord, perhaps this:

record = SeqIO.read(...)

instead? The SeqIO.parse(...) function loops over all the records, while the SeqIO.read(...) function returns the first and only record (or gives an error if there is not just one record in the file).

Next, you never actually add the SeqFeature to the SeqRecord's feature list, so it will never get written out in GenBank/EMBl format via SeqIO.write(...). This just makes a tuple,

record = (SeqIO.parse(dna_file,"fasta"),SeqFeature(...))

That tuple contains a generator function (which would iterator over the records) and a SeqFeature, which is not sensible here. This is what causes the error you saw, AttributeError: 'generator' object has no attribute 'name'.

Assuming record is a single SeqRecord object (see point above), try:

record.features.append((SeqFeature(...))

score 1 · Answer 2 · 2013-08-20

1

Entering edit mode

10.7 years ago

Philipp Bayer 8.3k

You do something really weird when you loop over your BLAST-results -

For each record, if there are alignment-hits, store the best alignment in the variable "blast_match". Then you create the variable "quali", but not with all "blast_match"es - instead, since you define "quali" outside the loop, you only get the very last blast_match. I don't think this is intended, since you want all blast_matches.

Then you overwrite your "record"-variable with a new "record"-variable that is based on parsing your "dna_file" - keep in mind that calling "SeqIO.parse" here will lead to BioPython parsing the entire file again, which takes forever if you have many records in your GFF3-file.

What is the exact error-message when you call "SeqIO.write" in the last line?

ADD COMMENT • link 10.7 years ago by Philipp Bayer 8.3k

0

Entering edit mode

Thanks. The variable "blast_match" should store the best hit from the blast analysis for each query. But, I think that I´m saving the first best hit in the loop.

The error that I get is this:

Traceback (most recent call last):
  File "/Users/k/Desktop/gff3_parse.py", line 104, in <module>
    SeqIO.write(record, f_handle, "genbank")
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 426, in write
    count = writer_class(fp).write_file(sequences)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 254, in write_file
    count = self.write_records(records)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 239, in write_records
    self.write_record(record)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/InsdcIO.py", line 695, in write_record
    self._write_the_first_line(record)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/InsdcIO.py", line 513, in _write_the_first_line
    locus = record.name
AttributeError: 'generator' object has no attribute 'name'

ADD REPLY • link 10.7 years ago by jcastrofigueroa ▴ 140

1

Entering edit mode

That's because instead of calling SeqIO.write(...) with a SeqRecord object, you've given it an interator/generator function. See my answer for details.

ADD REPLY • link 10.7 years ago by Peter 6.0k