Question: Extracting CDS using python
gravatar for Davi.marcon
10 days ago by
Davi.marcon0 wrote:

I have a large dataset of *gb files, and im using Bio.SeqIo.parse() for extracting cds and writing sequences on multifasta format, but it takes too much time. Is there a faster way to get that information using python? I need cds locus tag, sequence and annotation. I've took almost an hour for getting this information from 20 files.

gbk python genome • 122 views
ADD COMMENTlink modified 10 days ago by h.mon31k • written 10 days ago by Davi.marcon0

Can you share your code?

How big are your files?

From my experience, in some gb files the coding sequence is directly associated within the feature, if that is your case, than you can just read the feature, which should be blazing fast, once the SeqRecord is parsed. If not, then on each cds the translation must be done.

Either way, we are no clairvoyants here, so pls, if you can, share code and example data (one sequence suffice), otherwise, this is just gueswork.

ADD REPLYlink written 9 days ago by massa.kassa.sc3na340

Yes, Sure. My files have 9-12Mb Code:

def open_gbk(seq_file):
    sequence_list,id_list = [],[]
    for seq_record in SeqIO.parse(seq_file , 'genbank'):
        for feature in seq_record.features:
            if feature.type == "CDS":
                gene = feature
                feature_id = feature.qualifiers['locus_tag'][0]
                feature_product = feature.qualifiers['product'][0]
                feature_translation = feature.qualifiers['translation'][0]
                feature_sequence = feature.location.extract(seq_record).seq
                    if feature.qualifiers['db_xref'][0].split(":")[0] == "COG":
                        feature_cog = feature.qualifiers['db_xref'][0].split(":")[1]
                    feature_cog = "-" 

  return sequence_list

example data: Dowload Link

ADD REPLYlink modified 9 days ago • written 9 days ago by Davi.marcon0

Before this code i'm using the a class to store values, as follows:

class sequence:
    def __init__(self,seq_id,seq,seq_product,seq_translation,cog): = seq_id
        self.seq = seq
        self.product =seq_product
        self.translation = seq_translation
        self.cog = cog
ADD REPLYlink written 9 days ago by Davi.marcon0

Ok, I've looked into your code, and single most expensive line is this one

feature_sequence = feature.location.extract(seq_record).seq

You don't specify, if you want to have the DNA or the protein sequence for your CDS, and originally I've thought that you want the protein sequence. If that is the case then just remove this line and you should be fine.

From what I know, the creation of the Seq and SeqRecord objects is expensive in Biopython (they, are however powerful). And what you do in the extract is that you create new object for each gene.

If you really need to make this fast, then move from these objects to strings. (Create a string from genome sequence, and extract from that.) You will however need to handle yourself the reverse complement, and maybe introns, if you need to worry about them.

ADD REPLYlink written 7 days ago by massa.kassa.sc3na340

I've solved this problem creating my own version of parsing, using python native string manipulation and my run time now is around 2 minutes. If you want to check it out, you can acess it here: Do you think that the first parse is slow? Or just the extract? Thank you for the responses!

ADD REPLYlink written 7 days ago by Davi.marcon0

Hi, good for you and GJ for creating gbk parser. Although the Biopython parsers are not very fast, they are not worst. As I've wrote, the extract and the creating of the Seq and SeqRecord objects is what is slow here.

ADD REPLYlink written 6 days ago by massa.kassa.sc3na340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1399 users visited in the last hour