Question: Gff Parsing In Biopython Is Not So Perfect
1
gravatar for Gahoo
7.5 years ago by
Gahoo260
United States
Gahoo260 wrote:

I use BCBio.GFF to parse chr01.gff3 and all.gff3. But things didn't work out as I expect. Here's the code:

from BCBio import GFF
limits = dict(gff_type = ["gene","mRNA","CDS"])
gff_handle = open('chr01.gff3')
for rec in GFF.parse(gff_handle,target_lines=1000,limit_info=limits):
    #Chromosome seq level
    for gene_feature in rec.features:
        #gene level
        for mRNA_feature in gene_feature.sub_features:
            #mRNA level
            print mRNA_feature.type
            print mRNA_feature.qualifiers['Alias']

And I got:

Traceback (most recent call last):
  File "R:\Untitled 1.py", line 14, in <module>
    print mRNA_feature.qualifiers['Alias']
KeyError: 'Alias'

And the 'type' is "CDS" which is not correct. When parsing without

target_lines=1000

everything is ok. But parsing all.gff3 came to the same problem. Maybe all.gff3 is too huge to parse.

The problem might be due to the parser did not recognise the entry boudary correctly.

Any suggestions or alternative packages? I still want to do this in python.

python gff biopython • 6.5k views
ADD COMMENTlink modified 7.5 years ago by Brad Chapman9.4k • written 7.5 years ago by Gahoo260

Did you ask about this on the biopython mailing list?

ADD REPLYlink written 7.5 years ago by Scott Cain750

Nope. I didn't join the mailing list.

ADD REPLYlink written 7.5 years ago by Gahoo260
5
gravatar for Brad Chapman
7.5 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

Unfortunately this GFF file misuses the ### directive so the target_lines parameter will not be a reliable way to break the file keeping entire features together. Since GFF is a line oriented format with references between lines handled by ID/Parent relationships, it tries to intelligently break lines when entire features are together.

The ### directive in the file tells a parser that it has completed all nested features, so it would be safe to pass along finished objects. The GFF3 specification has more details. This GFF3 file issues those directives at the end of each transcript, which is a problem for multi-transcript genes. For instance:

Chr1    MSU_osa1r6      gene    15292   19323   .       +       . ID=13101.t00004;Name=expressed%20protein;Alias=LOC_Os01g01040
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00008;Parent=13101.t00004;Alias=LOC_Os01g01040.2
###
Chr1    MSU_osa1r6      mRNA    15321   19323   .       +       . ID=13101.m00007;Parent=13101.t00004;Alias=LOC_Os01g01040.3
###

The second mRNA references the first, but the ### is telling the parser it is okay to make a break here.

This why removing target_lines fixes the issue. You are seeing errors with all.gff3 because some of the mRNA features do not have Alias keywords. For instance:

ChrSy   MSU_osa1r6      mRNA    488     1166    .       -       . ID=13114.m00090;Parent=13114.t00002

If parsing the entire 'all.gff3` file uses too much memory a safe way to break is by each chromosome. This is slower because it needs to parse the file multiple times, but will guarantee correctly reconstructed gene/mRNA/CDS features:

import sys
from BCBio import GFF

in_file = sys.argv[1]

examiner = GFF.GFFExaminer()
with open(in_file) as gff_handle:
    possible_limits = examiner.available_limits(gff_handle)
chromosomes = sorted(possible_limits["gff_id"].keys())
print chromosomes
limits = dict(gff_type = ["gene","mRNA","CDS"])
for chrom in chromosomes:
    with open(in_file) as gff_handle:
        limits["gff_id"] = chrom
        for rec in GFF.parse(gff_handle, limit_info=limits):
            #Chromosome seq level
            for gene_feature in rec.features:
                #gene level
                for mRNA_feature in gene_feature.sub_features:
                    #mRNA level
                    print mRNA_feature.type
                    try:
                        print mRNA_feature.qualifiers['Alias']
                    except:
                        print gene_feature
                        raise
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Brad Chapman9.4k
2

Or to put it another way, GIGO. Did you try validating the GFF3?

ADD REPLYlink written 7.5 years ago by Peter5.8k

Thanks for your answer. ### is the problem. But the reason paring all.gff goes wrong is not because of the mRNA features without 'Alias'. The Exception raise before the parser gets to ChrSy. On my computer, 'LOC_Os09g40085.1' is the last correct one. The problem is similar with the one when pasring chr01.gff. But this time it might cause by the memory limits that make it parse the file incompletely.

ADD REPLYlink written 7.5 years ago by Gahoo260

Can you do a try/except on the qualifiers['Alias'] call and print out the problem mRNA_feature, like in the code example? When I do that the ChrSy line is the problem. The parser will not change behavior based on memory usage; instead you'd slow down and start swapping or get a python memory error.

ADD REPLYlink written 7.5 years ago by Brad Chapman9.4k
1
gravatar for Giovanni M Dall'Olio
7.5 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Can you please report this error to Biopython's bug tracker on redmine?

ADD COMMENTlink written 7.5 years ago by Giovanni M Dall'Olio26k

Error reported. https://redmine.open-bio.org/issues/3311

ADD REPLYlink written 7.5 years ago by Gahoo260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1057 users visited in the last hour