Entering edit mode
2.4 years ago
Davi.marcon • 0
I have a large dataset of *gb files, and im using Bio.SeqIo.parse() for extracting cds and writing sequences on multifasta format, but it takes too much time. Is there a faster way to get that information using python? I need cds locus tag, sequence and annotation. I've took almost an hour for getting this information from 20 files.
Can you share your code?
How big are your files?
From my experience, in some gb files the coding sequence is directly associated within the feature, if that is your case, than you can just read the feature, which should be blazing fast, once the SeqRecord is parsed. If not, then on each cds the translation must be done.
Either way, we are no clairvoyants here, so pls, if you can, share code and example data (one sequence suffice), otherwise, this is just gueswork.
Yes, Sure. My files have 9-12Mb Code:
example data: Dowload Link
Before this code i'm using the a class to store values, as follows:
Ok, I've looked into your code, and single most expensive line is this one
feature_sequence = feature.location.extract(seq_record).seq
You don't specify, if you want to have the DNA or the protein sequence for your CDS, and originally I've thought that you want the protein sequence. If that is the case then just remove this line and you should be fine.
From what I know, the creation of the
SeqRecordobjects is expensive in Biopython (they, are however powerful). And what you do in the
extractis that you create new object for each gene.
If you really need to make this fast, then move from these objects to strings. (Create a string from genome sequence, and extract from that.) You will however need to handle yourself the reverse complement, and maybe introns, if you need to worry about them.
I've solved this problem creating my own version of parsing, using python native string manipulation and my run time now is around 2 minutes. If you want to check it out, you can acess it here: https://github.com/Mxrcon/Biopytools/blob/master/pepnucfunction.py Do you think that the first parse is slow? Or just the extract? Thank you for the responses!
Hi, good for you and GJ for creating gbk parser. Although the Biopython parsers are not very fast, they are not worst. As I've wrote, the extract and the creating of the
SeqRecordobjects is what is slow here.