Question: How to pull gene identifier from genbank file in Biopython
1
gravatar for mdelow
3.7 years ago by
mdelow10
Canada
mdelow10 wrote:

Hello!

I am very new to the world of python (or any coding), but have installed biopython which I believe will help me do what I am trying to accomplish. I am working with RNAseq data for my project, and wanting to run my genelists through DAVID to look at enrichment. The output of the RNAseq pipeline thru bowtie/tophat/cuffdiff, gives me log fold increases for each gene, with the chromosome identifier of SMaxxxxx. That would be fine, but DAVID does not seem to be able to handle the fact that my organism has 3 replicons (1 chromosome and 2 megaplasmids), that are SMa, SMb and SMc. When I try to use these identifiers as the gene list, DAVID is only able to find 1/3 (ie. one replicon) and identify them as genes. I was playing around and it seems like it can identify genes from all replicons if I use the gene identifier such as lacI, manB, etc. In the genbank file the gene notation I want is labelled: gene="xxxX"

 I want to do two things:

1) go thru each replicons genbank file and pull out the gene identifier for each gene in the 4 letter format

2) feed in a list of 150 genes in the SMaxxxx format, and have it output the 4 letter gene identifier for each gene

So far I have been able to read in a genbank file, but not pull out the information I want.

#to read in list of genes I want to pull out

with open('synthgenelistfordavid.txt') as f:

    content = f.readlines()

wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')]

#then I've gotten this far, import file and set up a for loop

from Bio import SeqIO
genbank_file = "sma.gbk" 
for record in SeqIO.parse("sma.gbk", "genbank"):

parsing genbank biopython • 3.6k views
ADD COMMENTlink modified 3.7 years ago by Peter5.8k • written 3.7 years ago by mdelow10

Hi mdelow, 

For the first question : You can use BioPython to parse a genBank file. You have to know that, with BioPython if you want to have an access to a specific part of the file, you have to read the CookBook to find exactly what are you looking for and its name in the documentation. 
An alternative can be, if you have a "FLAG" to localize your information in the file, it is to develop you own Python Script. In you case, what are you interested in is the flag \gene="XXX". So, with the "if my_pattern in current_line" of python, you can try it.
Warning : If you have various genes in the file, you need to take it in account! ;)

For instance, in the CookBook, part 4.2.3, you have an example of the records objects attributes and what they are containing. I think not that exists an attribute "gene_name" with the "XXX".

For the second one, can you please clarify the point?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by glihm590

Doing some more reading, what I want is the gene="XXX", the identifier I have is the locus_tag. For my organism, the locus_tag is in the format SMaXXXX - but the name of that identifier is locus_tag - sorry I hadn't figured that out earlier.

So what I am trying to do in both cases is use a list of locus tags and loop thru every coding sequence in the genbank file.

The list for #2 is roughly 150 locus tags, the list for #1 is more like 6,000 locus tags (every gene in the genome).

So for #2, if the locus_tag in each CDS matches a locus_tag in my list, I want it to print that locus_tag and the corresponding gene identifier.

Same thing for #1, but the reality is in that case, that it is just all the genes. I was thinking I could use a list of all the genes to keep the same "list" approach to the problem, but another approach would be to ask biopython to print the locus_tag and corresponding gene ID for every CDS in the genbank file.

My genbank file only has one LOCUS, but many coding sequences (CDS).

Here is an example of some CDS side by side in the file. All CDS have a locus_tag, not all CDS have a gene name, because many genes in the genome are not identified genes, they just code for "hypothetical proteins". For the downstram process of using the DAVID tool, I can't use hypothetical proteins anyways, so only the locus_tags with gene names are useful.

CDS             complement(12259..12546)

                     /locus_tag="SMa5000"
                     /codon_start=1
                     /protein_id="ACF07957.1"
                     /translation="MSSNVSIISPWNMEFIRSVLGEAGYNAKLLVDDPRSFETAELLV
                     TKLFLAGEDSRPGLAAKLECQLGKAGGYRRPPGSSLARYAIQGLPSELQFS"
                     /product="hypothetical protein"
                     /transl_table=11
     repeat_region   12658..13716
                     /locus_tag="SMa3000"
                     /mobile_element="insertion:ISRm2011-2/ISRm11"
                     /note=" predicted by homology"
     gene            12754..13161
                     /locus_tag="SMa0018"
     CDS             12754..13161
                     /locus_tag="SMa0018"
                     /codon_start=1
                     /protein_id="AAK64667.1"
                     /translation="MARPFSNDLRERVVDAVTGEGLSCRAAAKRFGIGISTAIDWVRR
                     FRETGSAAPGQMGGHKPRKLSGPHRAWLLCRCRERDFTLHGLVAELSERGLKVDYRAV
                     WTFVHEEGLSYKKRRWSPANGSGPTSPATGHDG"
                     /product="TRm2011-2a transposase"
                     /transl_table=11

gene            9252..11252
                     /locus_tag="SMa0015"
                     /gene="selB"
     CDS             9252..11252
                     /locus_tag="SMa0015"
                     /protein_id="AAK64665.1"
                     /gene="selB"
                     /transl_table=11
                     /codon_start=1
                     /product="Selenocysteine-specific elongation factor"
                     /translation="MIVGTAGHIDHGKTTLVKALTGVDTDRLKEEKARGITIDLGFAY
                     ARFAKDAVTGFVDVPGHERFIHTMLAGAGGIDYAMLVVAADDGIKPQTLEHLAILDLL
                     GVSRGLVAITKADLADPARLENLTDEIGAVLSSTSLRDAEILPVSVAAGQGIELLKER
                     LAAAECATAASAVGGRFRLAVDRSFTLSGAGTVVTGTVLSGSVGVGDQVTVSPAGRSA
                     RVRSIHAQNQRAERGFAGQRCALNLAGEGISKDAITRGDMVVDPHLHAPSDRLDADLS
                     VLESETKPIGEWFSARFHHASAETGIRIVPLEGPLLPGERRRVQLVLDRPIAAAVGDR
                     FILRDVSARRTIGGGRLLDLRAPARKRRSPERLSYLQAASLSHAGEALAALLDVPPFL
                     VDLDVFARDRALSEAELQNAIRSASAEVIEGSAVRHALSKRQRAAFSDEVQRVLSAFH
                     VENPDLQGIGRERLRLQVTPRLPPPAFLVALRAEQTAGRLVLEGAFVRLPGHEVRLSE
                     KEEELYARILPHLEGEERFRPPRVRDFAEALGVDEREIRRILKLCARLGRVDQIRHDH
                     FFTRQTTAEMVAIIRQVAANAERGEFSAGLFRDRVNNGRKVAIEILEFFDRQGVTIRH
                     GDVRRVNPHRLDLYEGPVPEADEGRGSSPVERPDFKFYRAGS"

 

Is this making more sense?

ADD REPLYlink written 3.7 years ago by mdelow10

I solved it by doing this:

 

with open('synthgenelistfordavid.txt') as f:
    content = f.readlines()

wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')]

file=open("smcgenes.csv", "w")

from Bio import SeqIO
genbank_file = "smc.gbk" 
for record in SeqIO.parse("smc.gbk", "genbank"):
    for f in record.features:
    if f.type == "gene" and "locus_tag" in f.qualifiers:
        locus_tag = f.qualifiers["locus_tag"][0]
        if locus_tag in wanted:
        try:
            out1 = f.qualifiers["locus_tag"][0] + "," + f.qualifiers["gene"][0] + "\n"
            file.write(out1)
        except:
            out2 = f.qualifiers["locus_tag"][0] + ", no gene name" + "\n"
            file.write(out2)

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by mdelow10
0
gravatar for Peter
3.7 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

I think you would find this helpful, specifically the last section http://www.warwick.ac.uk/go/peter_cock/python/genbank/

Try something like this for map from the locus tags to the gene qualifiers:

from Bio import SeqIO
filename = "NC_005816.gb"
locus_to_gene = dict()
for record in SeqIO.parse(filename, "genbank"):
    for f in record.features:
        if f.type == "CDS":
            if "gene" in f.qualifiers:
                if "locus_tag" in f.qualifiers:
                    genes = f.qualifiers["gene"]
                    locus_tags = f.qualifiers["locus_tag"]
                    assert len(genes) == 1, genes
                    assert len(locus_tags) == 1, locus_tags
                    locus_to_gene[locus_tags[0]] = genes[0]
print("Mapped %i locus tags to genes" % len(locus_to_gene))
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Peter5.8k

I had already read this. I see that it's somewhat related, but do not know how to use that to accomplish what I am looking for.

ADD REPLYlink written 3.7 years ago by mdelow10

I've expanded on my answer based on the discussion above.

ADD REPLYlink written 3.7 years ago by Peter5.8k

If I am understanding this correctly, it would be printing a list of all of the locus tags and corresponding gene names in the genbank file?

ADD REPLYlink written 3.7 years ago by mdelow10

This returns: Mapped 0 locus tags to genes

When I run this:

Mapped 0 locus tags to genes

from Bio import SeqIO
filename = "sma.gbk"
locus_to_gene = dict()
for record in SeqIO.parse(filename, "genbank"):
    for f in record.features:
        if f.type == "CDS":
            if "gene" in f.qualifiers:
                if "locus_tag" in f.qualifiers:
                    genes = f.qualifiers["gene"]
                    locus_tags = f.qualifiers["locus_tag"]
                    assert len(genes) == 1, genes
                    assert len(locus_tags) == 1, locus_tags
                    locus_to_gene[locus_tags[0]] = genes[0]

print("Mapped %i locus tags to genes" % len(locus_to_gene))

 

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by mdelow10

Strange. Can you email me your GenBank  file? My GoogleMail address in on my BioStars profile.

ADD REPLYlink written 3.6 years ago by Peter5.8k

Thanks for your reply! It helps me a lot. Just a small question: my genbank file has "locus_tag" section but why biopython can not get access to it? A example:

     gene            complement(3015679..3037424)
                     /gene=ENSDARG00000043854
                     /locus_tag="ppil4"
                     /note="peptidylprolyl isomerase (cyclophilin)-like 4
                     [Source:ZFIN;Acc:ZDB-GENE-030131-6251]"
     mRNA            join(complement(3037328..3037424),
                     complement(3036319..3036386),complement(3035701..3035765),
                     complement(3035066..3035183),complement(3031417..3031559),
                     complement(3029838..3029934),complement(3028920..3029036),
                     complement(3028691..3028815),complement(3025931..3025997),
                     complement(3023570..3023681),complement(3020984..3021080),
                     complement(3020081..3020198),complement(3015679..3016986))
                     /gene="ENSDARG00000043854"
                     /note="transcript_id=ENSDART00000064390"
     CDS             join(complement(3037328..3037397),
                     complement(3036319..3036386),complement(3035701..3035765),
                     complement(3035066..3035183),complement(3031417..3031559),
                     complement(3029838..3029934),complement(3028920..3029036),
                     complement(3028691..3028815),complement(3025931..3025997),

When I try to use f.qualifiers["locus_tag"], there is KeyError. How can I get /locus_tag="ppil4" by using biopython? The GenBank file I use: ftp://ftp.ensembl.org/pub/release-79/genbank/danio_rerio/Danio_rerio.Zv9.79.chromosome.20.dat.gz

ADD REPLYlink modified 22 months ago by Peter5.8k • written 22 months ago by AlicePsyche20

The if "locus_tag" in f.qualifiers: condition should avoid the KeyError, see also Biopython: work with loops and qualifiers from multiple records

ADD REPLYlink written 22 months ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour