Question

Light-Weight Script-Based Random Access To Large Gff3 Files Using Biopython

3

Entering edit mode

12.4 years ago

Christian ★ 3.1k

I am wondering about the best practice in Biopython to work with large GFF3 files (>100 MB). As minimum requirement, I want to be able to quickly and randomly retrieve features by ID, name, and coordinates, and returned feature objects (SeqRecords?) should have the hierarchical feature structure (e.g. gene -> mRNA -> CDS) correctly mapped onto them. Write access would be nice but is optional. External dependencies should be kept minimal to increase portability to other systems.

I am currently thinking along the line of importing the GFF3 file into a BioSQL database (How? Can I use a SQLite database?) and then access this database with Biopython's BioSQL interface. GFFutils looks useful, but I would prefer to work with Biopython objects (e.g. Bio.SeqRecord).

I know how to do all this in BioPerl bp_seqfeature_load.pl and Bio::DB::SeqFeature::Store), but I am new to Biopython.

Edit: I changed the title a bit to better reflect my requirements.

gff biopython python gff3 • 5.5k views

ADD COMMENT • link 12.4 years ago by Christian ★ 3.1k

1

Entering edit mode

From its table schema, GFFUtils does not seem to use R-tree or binning. It probably won't achieve the speed you need. BioSQL supports the SQLite backend and does the schema part right. I guess by bioperl::store being slow you mean the parsing speed is slow (like always), but not the retrieval speed is slow? UCSC uses similar techniques to host terabytes of data. If I were to do this task, I would write the whole thing by myself. It might be a good one to learn something new if you do not know yet how BioSQL/UCSC achieve fast regional queries. As to the interface to BioPython, you can generate GFF lines from the database and let BioPython parse them. You naturally get Bio.SeqRecord.

ADD REPLY • link 12.4 years ago by lh3 33k

0

Entering edit mode

gffutils author here. OP's link is actually to a fork of an old version -- see https://github.com/daler/gffutils instead.

@lh3, adding UCSC binning should be trivial, though allowing backwards compatibility with existing gffutils databases will be a little awkward. But I should look into BioSQL's schema to help improve that of gffutils.

@Christian, a Bio.SeqRecord adapter should also be straightforward to add and would be generally useful -- I'll add it to the todo list.

ADD REPLY • link 12.4 years ago by Ryan Dale 5.0k

0

Entering edit mode

For a large file with millions of lines, a spatial index such as R-tree or UCSC binning is a must-have. As to the backward compatibility, you can append a "bin" column to the old version of the table when it is absent. Sqlite also supports the R-tree index, but it is not compiled by default. Probably it would be better not to rely on this feature. As to BioSQL, I said it is right only because it uses binning. I more like to stuff the entire GFF in one table as GFFUtils.

ADD REPLY • link 12.4 years ago by lh3 33k

0

Entering edit mode

Yes, by BioPerl's SeqFeature::Store being slow I meant the parsing part. But as I said, this only becomes an issue if you programmatically retrieve thousands of database features, which is why SeqFeature::Store's performance is in general sufficient for genome browsers where you usually look at small genomic regions at a time containing only few features.

Generating Bio.SeqRecords from BioSQL via a GFF intermediate could work, but is likely not very efficient because of the parsing overhead that comes with it. I am wondering if serialization/deserialization would be faster, i.e. one could store a serialized version of the Bio.SeqRecord object in the database and just deserialize it upon retrieval.

If I go with the BioSQL solution, I would have to import the GFF into BioSQL first. What would be the best/fastest way to load a GFF file into a BioSQL-schema database? So far I am only aware of the gff_to_biosql.py script from Brad Chapman available from GitHub.

ADD REPLY • link 12.4 years ago by Christian ★ 3.1k

score 1 · Answer 1 · 2013-02-20

1

Entering edit mode

12.4 years ago

vipin.ts ▴ 60

Here are some codes from Brad Chapman, which will help you to parse the GFF3 files into seqrecord objects based on the parent child relationship. https://github.com/chapmanb/bcbb/tree/master/gff/

The documentation can be found at: http://biopython.org/wiki/GFF_Parsing/

For example:

    from BCBio import GFF
    in_file = "your_file.gff"
    limit_info = dict(
                           gff_id = ["chr1"],
                           gff_source = ["Coding_transcript"])
    in_handle = open(in_file)
    for rec in GFF.parse(in_handle, limit_info=limit_info): 
           print rec.features[0]
    in_handle.close()

--Vipin

ADD COMMENT • link 12.4 years ago by vipin.ts ▴ 60

1

Entering edit mode

I already found Brad Chapman's GFF parser, but as far as I can tell it won't give me fast random access to features in the GFF file. The parser would still iterate over the whole file just to find the few features I am interested in. Of course I could store every feature in a dictionary and then query the dictionary for fast access, but this would take a lot of RAM for large files and I would have to parse the GFF input file every time I restart my script.

ADD REPLY • link 12.4 years ago by Christian ★ 3.1k

1

Entering edit mode

That's right, there isn't indexing built into the parser. There is a discussion on the gffutils GitHub page about integrating these features so we could get the best of both approaches: https://github.com/daler/gffutils/issues/2

ADD REPLY • link 12.4 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks for sharing this very relevant discussion. It appears to me that the development of gffutils is going into the right direction, and eventually might provide exactly the solution that I am looking for.

ADD REPLY • link 12.4 years ago by Christian ★ 3.1k

score 0 · Answer 2 · 2013-02-20

A SeqFeature database (in BioPerl) is not the fastest for gff3, but has its advantages for other applications (like GBrowse). The current fastest would be Bio::gff3, but that is not the point of your question.

I think the Fastq reader in Biopython already uses the implementation you mention via SQLite, so make sure you don't try to implement something that is in place already. I'm sorry I can't offer any python code, but I will say that BerkeleyDB is much faster than SQLite for simple transactions. If there are some complex DB commands you need to invoke, then use SQLite.

EDIT: I should qualify my statement about BerkeleyDB for future reference. I think the correct answer is that it may be faster than SQLite and the performance is probably also influenced by the API. There was an excellent discussion on this topic a while back that is informative (see the white papers at the end of the top answer for further reading). I agree with lh3 that going down this path would be a great learning experience and you would probably come out with something really useful, but for an easier approach I would probably pursue Istvan's suggestion. Hope that helps.

score 0 · Answer 3 · 2013-02-21

0

Entering edit mode

12.4 years ago

Istvan Albert 102k

I think the best approach would be something based on tabix and python interfaces around it. I think pysam has tabix readers:

tabix: http://samtools.sourceforge.net/tabix.shtml

pysam: http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html

ADD COMMENT • link 12.4 years ago by Istvan Albert 102k

0

Entering edit mode

tabix would allow me to retrieve features by genomic position, but not by ID or name, correct?