Extracting CDS using python
0
0
Entering edit mode
2.3 years ago

I have a large dataset of *gb files, and im using Bio.SeqIo.parse() for extracting cds and writing sequences on multifasta format, but it takes too much time. Is there a faster way to get that information using python? I need cds locus tag, sequence and annotation. I've took almost an hour for getting this information from 20 files.

genome Python Gbk • 2.4k views
0
Entering edit mode

From my experience, in some gb files the coding sequence is directly associated within the feature, if that is your case, than you can just read the feature, which should be blazing fast, once the SeqRecord is parsed. If not, then on each cds the translation must be done.

Either way, we are no clairvoyants here, so pls, if you can, share code and example data (one sequence suffice), otherwise, this is just gueswork.

0
Entering edit mode

Yes, Sure. My files have 9-12Mb Code:

def open_gbk(seq_file):
sequence_list,id_list = [],[]
for seq_record in SeqIO.parse(seq_file , 'genbank'):
for feature in seq_record.features:
if feature.type == "CDS":
gene = feature
feature_id = feature.qualifiers['locus_tag'][0]
feature_product = feature.qualifiers['product'][0]
feature_translation = feature.qualifiers['translation'][0]
feature_sequence = feature.location.extract(seq_record).seq
try:
if feature.qualifiers['db_xref'][0].split(":")[0] == "COG":
feature_cog = feature.qualifiers['db_xref'][0].split(":")[1]
except:
feature_cog = "-"

sequence_list.append(sequence(feature_id,feature_sequence,feature_product,feature_translation,feature_cog))
return sequence_list


0
Entering edit mode

Before this code i'm using the a class to store values, as follows:

class sequence:
def __init__(self,seq_id,seq,seq_product,seq_translation,cog):
self.id = seq_id
self.seq = seq
self.product =seq_product
self.translation = seq_translation
self.cog = cog

1
Entering edit mode

Ok, I've looked into your code, and single most expensive line is this one

feature_sequence = feature.location.extract(seq_record).seq

You don't specify, if you want to have the DNA or the protein sequence for your CDS, and originally I've thought that you want the protein sequence. If that is the case then just remove this line and you should be fine.

From what I know, the creation of the Seq and SeqRecord objects is expensive in Biopython (they, are however powerful). And what you do in the extract is that you create new object for each gene.

If you really need to make this fast, then move from these objects to strings. (Create a string from genome sequence, and extract from that.) You will however need to handle yourself the reverse complement, and maybe introns, if you need to worry about them.

0
Entering edit mode

I've solved this problem creating my own version of parsing, using python native string manipulation and my run time now is around 2 minutes. If you want to check it out, you can acess it here: https://github.com/Mxrcon/Biopytools/blob/master/pepnucfunction.py Do you think that the first parse is slow? Or just the extract? Thank you for the responses!

0
Entering edit mode

Hi, good for you and GJ for creating gbk parser. Although the Biopython parsers are not very fast, they are not worst. As I've wrote, the extract and the creating of the Seq and SeqRecord objects is what is slow here.