How To Retrieve Genbank Records With Range Of Accession Numbers
4
6
Entering edit mode
11.8 years ago

A publication I was reading provided two ranges of GenBank accession numbers for supplementary data.

The ESTs from GR_Ea and GR_Eb were deposited in GenBank under accession nos. CO069431–CO100583 and CO100584–CO132899.]

If I search by a single accession number in GenBank I have no problem pulling up a record, but I obviously don't want to do this for thousands of EST records. Is there a way that I can provide a range of accession numbers (as above) and retrieve all these records simultaneously from GenBank? I am using GenBank's web interface right now, but I also wouldn't mind knowing how to do this on the command line as well.

Thanks!

genbank • 36k views
ADD COMMENT
13
Entering edit mode
11.8 years ago
Rm 8.2k

Try this

http://www.ncbi.nlm.nih.gov/nucest?term=CO069431:CO100583[accn]

or can use with list of acc numbers in a file to upload.

NCBI Batch download: http://www.ncbi.nlm.nih.gov/sites/batchentrez?db=Nucleotide

for EST: use db = nucest

http://www.ncbi.nlm.nih.gov/sites/batchentrez?db=Nucest

ADD COMMENT
2
Entering edit mode

Yet another pearl from the sea of NCBI...

ADD REPLY
1
Entering edit mode

cool ! I didn't known this 'accn' field !

ADD REPLY
1
Entering edit mode

Useful link: How To: Download a large, custom set of records from NCBI: http://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/

ADD REPLY
0
Entering edit mode

Great. This is what I was looking for. The filters are powerful...now I just need a reason to take the time to learn them!

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
11.8 years ago

You could try the following shell script (only your first range here:)

j=69431;
while [ $j -le 100583 ]
do
   acn=`printf "CO%06d" $j`;
   curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${acn}&rettype=fasta"
   j=$((j+1))
done

>gi|48738912|gb|CO069431.1|CO069431 GR__Ea26A01.r GR__Ea Gossypium raimondii cDNA clone GR__Ea26A01 3', mRNA sequence
GTGACCAGAGGCTACTTGATGCTAGCCTCTCGAGACCTCAGGCGTGCTAGAGCCGCAGCTCTCAACATCG
TCCCGACTTCACTGGTGCGGCAAAGGCCGTGGCCCTTGTACTCCCTACTCTCAAAGGCAAACTTAACGGC
ATCGCATTGCGTGTACCAACACCAAATGTGTCGGTGGTGGACCTAGTGGTCCAGGTTTCAAAGAAGACGT
TTGCTGAAGAGGTGAACGCTGCTTTCAAAGAGAGTGCAGAGAAAGAGCTACAGGGTATACTTTCAGTGTG
TGAAGAACCCCTCGTTTCAGTGGACTTCAGGTGCTCTGATGTGTCCTCCACCGTTGATGCATCACTCACC
ATGGTCATGGGAGATGACATGGTTAAGGTGATTGCTTGGTATGACAATGAGTGGGGCTACTCTCAAAGGG
TTGTGGATTTGGCTGACATTGTTGCCAATAGCTGGAAGTGATTTCAATGTGCTATACATACATATATGCA
TAACAATGTCACCGATGGTTGATTTTTGCATGCTCACTTCATTTTTATTCTTTCGGCTTCAGCAATTTCT
CATTTTGTCAAGGCTACTATATAATCTGTAATGTAATGTGGGATACATACATTCTCTAATATGCTTATGG
AATAAA

>gi|48738913|gb|CO069432.1|CO069432 GR__Ea26A02.f GR__Ea Gossypium raimondii cDNA clone GR__Ea26A02 5', mRNA sequence
AAAAAAAATTGGCCCTTTTTTTTAAAAAAAAGAGAAAAAGGGTCTTTGCCCCCAAAAAAAAAACCCCCCA
GGAATTTTTTCCCAAAATTCGGGGGACCCCCAAAAATTAAACAGGGAAATTGGCAATTTTACCCCCCCCC
CCCCCCCGGGGGGGGAAATTTAAGGGGAAAAAACCCAAAACAAAAGGGGGGCCCCCGGGTGGGGGGGGGA
CCCAATTCAGGACCCCCCCCCTCGGGGGGTCAAAAACCCGGGTTAAAAAACTTAAGAAACCCCTTTCCCA
GTTTCAGGGAAAATTTCTCCCCCCTTTTCGGGGGCTTCATTGGCTTTTTCAGCAGGGGGAAAGACATTTT
CCCATTCTTCCCTTCCAAAAAAAAACCCCGGCCCAAATTGGGGGGCCCCCCGCACCTGTCAAGGGGGGCA
CCAGGGGGCGGGCCCAGGGTTTCTTTAAAAAAAATGGGCAAAAAGGGGAAAGCTAATCCGGGCCCCCTAA
ACCCAAAAGCTTGTTTCCCTGGCCCCCC
ADD COMMENT
1
Entering edit mode
5.3 years ago
cmdcolin ★ 2.3k

You can use ncbi edirect tools (brew install brewsci/bio/edirect) and run something like

cat file_with_ids.txt | while read p; do echo $p; esearch -db nucleotide -query $p | efetch -format fasta > $p.fasta; done;

or more simple

cat file_with_ids.txt | while read p; do echo $p; efetch -db nucleotide -id $p -format fasta > $p.fasta; done;

I mention both just because I have seen seen the esearch piped to efetch in ncbi docs elsewhere, but if you have the ID it seems easier to just pipe the ID directly

Note that you might also need to manually install cpan Mozilla::CA since the homebrew doesn't seem to handle that properly

ADD COMMENT
1
Entering edit mode

Thanks for the command. It was very helpful!!

ADD REPLY
0
Entering edit mode

HI Colin,

I provided the list of ID's in the text file, it's not downloading the files. Were do I get the index file ?

bash fetch.sh

Missing idxfile for option -i.

EFETCH - retrieve entries from sequence databases.

Synopsis: efetch -options [database:]<query>

Databases: SWissprot/SP, PIR, WOrmpep/WP, EMbl, GEnbank/GB, ProDom, ProSite

Options: -a Search with Accession number -f Fasta format output -q Sequence only output (one line) -s <#> Start at position # -e <#> Stop at position # -o More options and info...

-D <dir>      Specify database directory
-H            Display index header data
-p            Display entrynames in search path
-r            Print sequence in 'raw' format
-m            Fetch from mixed mini database
-M            Mini format output
-b            Do NOT reverse the order of bytes
                          (SunOS, IRIX do reverse, Alpha not)
-d <dbfile>   Specify database file (avoid this)
-i <idxfile>  Specify index file (avoid this)
-l <divfile>  Specify division lookup table (avoid this)
-B <database> Specify database (archaic)
-A            Only return entryname for accession number
-n <name>     Give the sequence this name
-x            Don't require query to match entry's name exactly (avoid)
-w            For Wormpep: also fetch cross-referenced SwissProt entry
-h            shows this help text

Environment: SWDIR = SwissProt directory - database and EMBL index files PIRDIR = PIR -- " -- WORMDIR = Wormpep -- " -- EMBLDIR = EMBL -- " -- GBDIR = Genbank -- " -- PRODOMDIR = ProDom -- " -- PROSITEDIR = ProSite -- " -- DBDIR = User's own -- " -- (fasta format)

SEQDB database file (default SwissProt) SEQDBIDX index file DIVTABL division lookup table

Ex. setenv DBDIR /pubseq/seqlibs/embl/

Note that Prodom family consensus seqs can be fetched by PD:_#

by Erik Sonnhammer (esr@sanger.ac.uk) Version 2.1,

ADD REPLY
0
Entering edit mode

hi @sunnykevin97 what command did you run (e.g. what is fetch.sh?). in my post i don't use -i also, i use -id though

ADD REPLY
0
Entering edit mode
cat file_with_ids.txt | while read p; do echo $p; esearch -db nucleotide -query $p | efetch -format fasta > $p.fasta; done;
ADD REPLY
0
Entering edit mode

I couldn't tell you without more info: try to give as much info as possible when asking questions. this saves everyone time. for reference, this works for me esearch -db nucleotide -query CO069432.1|efetch -format fasta and fundamentally, all my command is doing is running that in a loop

ADD REPLY
0
Entering edit mode
11.8 years ago
Lee Katz ★ 3.1k

Pretty much the same answer as in a previous question, Downloading Fasta Files

# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
  [0, 100],
  [1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs
ADD COMMENT

Login before adding your answer.

Traffic: 1239 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6