Question: Extracting Gene nucleotide sequences from a Genbank files using Biopython
0
gravatar for tpaisie
2.2 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

Hi everyone, I'm trying to extract a CDS based on the product value. Basically want to extract a certain gene from multiple genbank files. This is the code I have so far:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature
from Bio.SeqRecord import SeqRecord

gb_file = open("fimA_seqs.gb", "r")

for gb_record in SeqIO.parse(gb_file, "genbank"):
    # now do something with the record
    for feat in gb_record.features:
        if feat.type == "CDS":
            product = feat.qualifiers['product'][0]
            if product == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

So I want to take the nucleotide sequence from any CDS (feature) with that product (qualifier) label and put them all in the same fasta file. Hopefully that makes sense! Any help would be greatly appreciated!!

sequence python gene • 3.5k views
ADD COMMENTlink modified 2.2 years ago by jrj.healey12k • written 2.2 years ago by tpaisie70
0
gravatar for jrj.healey
2.2 years ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

See my answer here - specifically the last script to extract CDS features as NA.

A: How do can I use Biopython and SeqIO to parse out multiple genes from several NC

You should be able to modify the code easily with something akin to your line:

 if feature.id == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

Obviously this will only work if the ID matches exactly. I don't know what your input file is like. Something like:

if 'FimA' in feature.id:

might work if the feature name isn't always the same - have a play around. This link might also be useful: http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html

This is a pretty common question on the forum BTW so make sure to search.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by jrj.healey12k

So i can't incorporate feature.id in my script and work. Are you saying instead of saying:

if product == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

I should put:

if feature.id == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

or: if 'FimA' in feature.id:

I also would like my script to use SeqIO.write to make the fasta file.

ADD REPLYlink written 2.2 years ago by tpaisie70
1

My mistake, I misunderstood what you were trying to do. Here's what you'd need to try. I haven't implimented a SeqIO write, but that should be fairly straightforward for you.

# $ python extractNAfeature.py genbank.gbk outfile.fasta

from Bio import SeqIO
import sys

product = 'NADH dehydrogenase subunit 5'

with open(sys.argv[2], 'w') as nfh:
        for rec in SeqIO.parse(sys.argv[1], "genbank"):
                if rec.features:
                        for feature in rec.features:
                                if feature.type == "CDS":
                                    if feature.qualifiers['product'][0] == product:
                                        nfh.write(">%s|%s from %s\n%s\n" % (
                                                  feature.qualifiers['gene'][0],
                                                  feature.qualifiers['product'][0],
                                                  rec.name,
                                                  feature.location.extract(rec).seq))

Note that that is kind of old syntax for building a string these days, but it works fine for this.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by jrj.healey12k
1

Thank you. I was able to adapt this for my own work. The code works great. Cheers.

ADD REPLYlink written 22 months ago by 7ejoseph20

So its when I used this code, just changed the product with the string i want to find, it gives me no error, but my output file is completely empty. I'm not sure what is going wrong, the code makes sense to me. With no error message i'm still kind of lost.

ADD REPLYlink written 2.2 years ago by tpaisie70

It needs to match exactly. You'll need to copy in the relevant section of the genbank and your exact search query for me to be able to help you any further. The code works just fine for me with a dummy genbank and random query

ADD REPLYlink written 2.2 years ago by jrj.healey12k

yep my "product =" was incorrect. So if i wanted to make have it search for a gene, so it just looks for:

product = "FimA"

instead of the exact phrase, how would I go about that?

ADD REPLYlink written 2.2 years ago by tpaisie70
1

Really easy. Python has nice tricks when it comes to comparing strings. If you simply change:

if feature.qualifiers['product'][0] == product:

to

if product in feature.qualifiers['product'][0]:

You'll get everything that contains that string. Bear in mind this will still be case sensitive, and will give you every gene that has that string in it's product line. Since product lines can be quite variable you might get some genes back which aren't FimA, but maybe interact with it and mention it in the product line, that kind of thing.

In the test files I'm using for example (a mitochondrial genome GBK) and a random gene, in this case, NADH, I get 6 results in my fasta file:

>ND1|NADH dehydrogenase subunit 1 from NC_010689
>ND2|NADH dehydrogenase subunit 2 from NC_010689
>ND3|NADH dehydrogenase subunit 3 from NC_010689
>ND4L|NADH dehydrogenase subunit 4L from NC_010689
>ND4|NADH dehydrogenase subunit 4 from NC_010689
>ND5|NADH dehydrogenase subunit 5 from NC_010689
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by jrj.healey12k
1

Thank you so much for your help! that worked!!

ADD REPLYlink written 2.2 years ago by tpaisie70

If it's all solved, make sure you accept this as an answer :)

ADD REPLYlink written 2.2 years ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2047 users visited in the last hour