Tool: [share] Script to bulk retrieve DNA sequences by protein IDs from GenBank
8
gravatar for qiyunzhu
4.9 years ago by
qiyunzhu130
United States
qiyunzhu130 wrote:

Hi all,

I recently wrote a very simple Python script, and guess it would be useful for some people. You throw in a list of protein accessions or GIs, and the script will download the corresponding DNA sequences (CDS) from NCBI GenBank.

Here is the download link: https://drive.google.com/uc?export=download&id=0BxcrsszS5z4FMGt4UklSeGo4UWs

Have fun!

Best,

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by qiyunzhu130
1

sorry to ruin the party, but there are already the Entrez Direct E-utilities tools that allow you to do so. Nice work anyway!

ADD REPLYlink written 4.9 years ago by Giovanni M Dall'Olio26k

Oh they look quite handy! I knew E-utilities before but I didn't know that they can be run as unix commands. Thanks for sharing.

ADD REPLYlink written 4.9 years ago by qiyunzhu130

Actually I find it extremely helpful - thanks for sharing it!

EDIT: After testing the script with more sequences, it is failing to retrieve some 10% of the CDS. I tried to re-run these which failed, but they failed again. "Supplied id parameter is empty." was retrieved for them instead of their sequence. I checked for these protein IDs their GenBank entries and the CDS information is there. So either there are some hidden problems in the GenBank or your script has some flaws.

ADD REPLYlink modified 10 months ago • written 3.1 years ago by al-ash110

Thanks for your comment! Can you provide some examples (protein accession no.s) that result in "supplied id parameter is empty"? Let me see if I can fix it.

ADD REPLYlink written 3.1 years ago by qiyunzhu130

Sure, e.g.: XP_012247746 XP_012248212 XP_012245590 (I tried to run the python script twice so it should not be just a temporal NCBI hiccup)

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by al-ash110

While using your script again after some time I found that all the sequences for which retrieval fails ('Supplied id parameter is empty.'), are label in GenBank as "partial mRNA", i.e. not full CDS while it did not fail for any of my protein IDs for which there is a corresponding full mRNA in the database.

UPDATE: the incomplete CDS have ">" or "<" characters in their header to indicate their incompletness at the 3' resp 5' end, i.e. XP_012174533.1|XM_012319143.2:<1..942 or XP_012174003.2|XM_012318613.2:20..>1197 ...maybe handling these characterscauses some problems to the script...

ADD REPLYlink modified 19 months ago • written 19 months ago by al-ash110
2
gravatar for lelle
4.9 years ago by
lelle800
Berlin
lelle800 wrote:

I was just working on something similar, so I gave it a try.

Here is some feedback:

It does not adhere to the eutils user guideline It is very likely that it will make more than 3 requests per second.

It does not work for all entries (e.g. this one).

It should give some indication if it can not find a DNA sequence for a protein and not fail silently like it is now.

ADD COMMENTlink written 4.9 years ago by lelle800

Thanks for your comments. All of them are helpful to me. I should have spent a little more time on error handling.

ADD REPLYlink written 4.9 years ago by qiyunzhu130
0
gravatar for 5heikki
4.9 years ago by
5heikki8.5k
Finland
5heikki8.5k wrote:

Here's a script that utilizes Entrez Direct and protein GIs:

#!/bin/bash
IFS=$'\n'

if [ -n "$1" ]
then
split -l 500 $1 input.

for f in input.*
do
query=$(cat $f | tr "\n" ",")
efetch -db protein -id $query -format fasta > $f.faa.tmp
rm $f
done

cat *.faa.tmp | grep . > $1.faa
rm *.faa.tmp

else
echo "Usage: fetchSeqs listOfNcbiGis"
echo "Requires Entrez Direct in \$PATH: http://www.ncbi.nlm.nih.gov/books/NBK179288/"
fi

 

edit. Actually, it doesn't do what OP's script does. Just proteins with protein GIs. If you want nucleotides with protein GIs, replace the efetch line with:

epost -db protein -id $query | elink -target nuccore | efetch -format fasta_cds_na > $f.faa.tmp

edit2. actually, I think this fetches all CDS of the related nt molecule. I linked gi to cds_nt before but can't remember right now how to do it :(

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by 5heikki8.5k

This also fails for my example record above, but I start to think that is because the information is just not there.

ADD REPLYlink written 4.9 years ago by lelle800

NCBI records seem to have lots of issues. For example, some sequences point to an organism, but its taxonomic information is obsolete. It is indeed tricky to work with NCBI data though.

ADD REPLYlink written 4.9 years ago by qiyunzhu130
0
gravatar for Chrispin Chaguza
4.9 years ago by
Wellcome Sanger Institute
Chrispin Chaguza250 wrote:

Here is python script that can also be used to download GenBank files given accession numbers.

Copy the python code below and save it as fetch_accession.py. Create a text file and give it a name say accessions.txt. Write these accession numbers (AP010935.1, AE014074.1, CP001129.1) to the file and save.

Run it as follows: ./fetch_accession.py -a accessions.txt

The output should files with the accession id followed by .gb and .fasta extensions.

***************************

#!/usr/bin/env python
import os, sys, glob
from Bio import SeqIO, Entrez
import argparse

def main():
    opts=argparse.ArgumentParser(sys.argv[0],description="fetch genbank files given accessions",
        prefix_chars="-",add_help=True,epilog="2014")
    opts.add_argument("-a",action="store",nargs=1,metavar="ACCESSION FILE",dest="ACCESSION",help="specify accessions file",required=True)
    options=opts.parse_args()
    input_file=options.ACCESSION[0]
    indata=""
    ncounter=0
    try:
        indata=[str(j).strip() for j in open(str(input_file),"rU")]
    except IOError as inError:
        print "Error: Failed to open input file "+str(input_file)+"..."
        print "\n\nException caught: \n",inError
        sys.exit()

    try:
        for i in list(indata):
            ncounter=ncounter+1
            print "Parsing...\t"+str(i)+"\t"+str(ncounter)+"/"+str(len(indata))
            seqfile=Entrez.efetch(email="example@email.com",db="nucleotide",id=str(i),retmode="text",rettype="gb")
            gb_record=SeqIO.read(seqfile,"genbank")
            SeqIO.write(gb_record,str(i)+".gb","genbank")
            SeqIO.write(gb_record,str(i)+".fasta","fasta")
    except Exception as exError:
        print "Error: Problem encountered while downloading files..."
        sys.exit()
if __name__ == "__main__":
    main()

 

ADD COMMENTlink written 4.9 years ago by Chrispin Chaguza250
0
gravatar for qiyunzhu
4.9 years ago by
qiyunzhu130
United States
qiyunzhu130 wrote:

Thanks guys for your insightful comments and contributions to this issue!

ADD COMMENTlink written 4.9 years ago by qiyunzhu130

Hello every one

I have a list of genes accession number and i want to download corresponding CDS sequence for each from NCBI in fasta format kindly guide me how can i do so ??? I'm new in bioinformatics ..

and the following link is daed .. :(

Here is the download link: https://drive.google.com/uc?export=download&id=0BxcrsszS5z4FMGt4UklSeGo4UWs

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by adeena_hassan40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1604 users visited in the last hour