Question: Light-Weight Script-Based Random Access To Large Gff3 Files Using Biopython
gravatar for Christian
7.6 years ago by
Cambridge, US
Christian2.9k wrote:

I am wondering about the best practice in Biopython to work with large GFF3 files (>100 MB). As minimum requirement, I want to be able to quickly and randomly retrieve features by ID, name, and coordinates, and returned feature objects (SeqRecords?) should have the hierarchical feature structure (e.g. gene -> mRNA -> CDS) correctly mapped onto them. Write access would be nice but is optional. External dependencies should be kept minimal to increase portability to other systems.

I am currently thinking along the line of importing the GFF3 file into a BioSQL database (How? Can I use a SQLite database?) and then access this database with Biopython's BioSQL interface. GFFutils looks useful, but I would prefer to work with Biopython objects (e.g. Bio.SeqRecord).

I know how to do all this in BioPerl and Bio::DB::SeqFeature::Store), but I am new to Biopython.

Edit: I changed the title a bit to better reflect my requirements.

gff3 python gff biopython • 3.2k views
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Christian2.9k

From its table schema, GFFUtils does not seem to use R-tree or binning. It probably won't achieve the speed you need. BioSQL supports the SQLite backend and does the schema part right. I guess by bioperl::store being slow you mean the parsing speed is slow (like always), but not the retrieval speed is slow? UCSC uses similar techniques to host terabytes of data. If I were to do this task, I would write the whole thing by myself. It might be a good one to learn something new if you do not know yet how BioSQL/UCSC achieve fast regional queries. As to the interface to BioPython, you can generate GFF lines from the database and let BioPython parse them. You naturally get Bio.SeqRecord.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by lh332k

gffutils author here. OP's link is actually to a fork of an old version -- see instead.

@lh3, adding UCSC binning should be trivial, though allowing backwards compatibility with existing gffutils databases will be a little awkward. But I should look into BioSQL's schema to help improve that of gffutils.

@Christian, a Bio.SeqRecord adapter should also be straightforward to add and would be generally useful -- I'll add it to the todo list.

ADD REPLYlink written 7.6 years ago by Ryan Dale4.9k

For a large file with millions of lines, a spatial index such as R-tree or UCSC binning is a must-have. As to the backward compatibility, you can append a "bin" column to the old version of the table when it is absent. Sqlite also supports the R-tree index, but it is not compiled by default. Probably it would be better not to rely on this feature. As to BioSQL, I said it is right only because it uses binning. I more like to stuff the entire GFF in one table as GFFUtils.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by lh332k

Yes, by BioPerl's SeqFeature::Store being slow I meant the parsing part. But as I said, this only becomes an issue if you programmatically retrieve thousands of database features, which is why SeqFeature::Store's performance is in general sufficient for genome browsers where you usually look at small genomic regions at a time containing only few features.

Generating Bio.SeqRecords from BioSQL via a GFF intermediate could work, but is likely not very efficient because of the parsing overhead that comes with it. I am wondering if serialization/deserialization would be faster, i.e. one could store a serialized version of the Bio.SeqRecord object in the database and just deserialize it upon retrieval.

If I go with the BioSQL solution, I would have to import the GFF into BioSQL first. What would be the best/fastest way to load a GFF file into a BioSQL-schema database? So far I am only aware of the script from Brad Chapman available from GitHub.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Christian2.9k
gravatar for vipin.ts
7.6 years ago by
New York City
vipin.ts60 wrote:

Here are some codes from Brad Chapman, which will help you to parse the GFF3 files into seqrecord objects based on the parent child relationship.

The documentation can be found at:

For example:

    from BCBio import GFF
    in_file = "your_file.gff"
    limit_info = dict(
                           gff_id = ["chr1"],
                           gff_source = ["Coding_transcript"])
    in_handle = open(in_file)
    for rec in GFF.parse(in_handle, limit_info=limit_info): 
           print rec.features[0]


ADD COMMENTlink written 7.6 years ago by vipin.ts60

I already found Brad Chapman's GFF parser, but as far as I can tell it won't give me fast random access to features in the GFF file. The parser would still iterate over the whole file just to find the few features I am interested in. Of course I could store every feature in a dictionary and then query the dictionary for fast access, but this would take a lot of RAM for large files and I would have to parse the GFF input file every time I restart my script.

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Christian2.9k

That's right, there isn't indexing built into the parser. There is a discussion on the gffutils GitHub page about integrating these features so we could get the best of both approaches:

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Brad Chapman9.5k

Thanks for sharing this very relevant discussion. It appears to me that the development of gffutils is going into the right direction, and eventually might provide exactly the solution that I am looking for.

ADD REPLYlink written 7.6 years ago by Christian2.9k
gravatar for SES
7.6 years ago by
Vancouver, BC
SES8.4k wrote:

A SeqFeature database (in BioPerl) is not the fastest for gff3, but has its advantages for other applications (like GBrowse). The current fastest would be Bio::gff3, but that is not the point of your question.

I think the Fastq reader in Biopython already uses the implementation you mention via SQLite, so make sure you don't try to implement something that is in place already. I'm sorry I can't offer any python code, but I will say that BerkeleyDB is much faster than SQLite for simple transactions. If there are some complex DB commands you need to invoke, then use SQLite.

EDIT: I should qualify my statement about BerkeleyDB for future reference. I think the correct answer is that it may be faster than SQLite and the performance is probably also influenced by the API. There was an excellent discussion on this topic a while back that is informative (see the white papers at the end of the top answer for further reading). I agree with lh3 that going down this path would be a great learning experience and you would probably come out with something really useful, but for an easier approach I would probably pursue Istvan's suggestion. Hope that helps.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by SES8.4k

I worked quite a bit with Bio::DB::SeqFeature::Store in BioPerl and I can confirm that speed becomes an issue if you have to process thousands of features. So I was wondering if something similar or better exists in Biopython.

So when I read your answer correctly, there is currently no equivalent to the Fastq reader with database backend available for GFF files?

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Christian2.9k

I'm not sure what method Biopython uses for GFF files. I was just trying to say that the code is already implemented for reading large Fastq files, so even if it is not in place for reading GFF files, you don't need to start from scratch if this is what you are trying to achieve. Sorry if that was unclear.

ADD REPLYlink written 7.6 years ago by SES8.4k
gravatar for Istvan Albert
7.6 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

I think the best approach would be something based on tabix and python interfaces around it. I think pysam has tabix readers:



ADD COMMENTlink written 7.6 years ago by Istvan Albert ♦♦ 85k

tabix would allow me to retrieve features by genomic position, but not by ID or name, correct?

ADD REPLYlink written 7.6 years ago by Christian2.9k

That is why I have not recommended tabix.

ADD REPLYlink written 7.6 years ago by lh332k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1210 users visited in the last hour