Question

Downloading plasmid sequences using refseq ids

0

Entering edit mode

10.3 years ago

bioinfo ▴ 840

Is there any easy way to download Plasmids from NCBI plasmids site using a list of RefSeq ids in a file?

RefSeq ids looks like as below in a txt file:

NC_017775
NC_017810
NC_017776
NC_017777
.........
NC_017811
NC_017778
NC_017779

perl sequence fasta NCBI plasmids • 5.1k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by bioinfo ▴ 840

score 1 · Answer 1 · 2014-09-10

Easier solution

curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_017775,NC_017810&rettype=fasta'

To get more sequences, just add refseq accession numbers, comma-separated, to the list. This will work until the URL is up to about 2k in length. If you have more sequences than that, you can break your list of ids into comma-separated blocks of just under 2k, and iterate over the blocks.

score 0 · Answer 2 · 2014-07-21

0

Entering edit mode

10.3 years ago

Pierre Lindenbaum 164k

search this site for Efetch:

 echo -e "NC_017775\nNC_017810" | while read ACN; do curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${ACN}&retmode=text&rettype=fasta" ; done

ADD COMMENT • link 10.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I was thinking if I just download all the sequences from the ftp site first then extract the required fasta sequences using the RefSeq id list in the txt file then that should work as well. I just tried this but got stuck.

cut -c 1- test.id.txt | xargs -n 1 samtools faidx large.dowloaded.plasmid.fasta (it requires the entire Fasta header as ID in the test.id.txt file.

the entire header line e.g. gi|386858858|ref|NC_017775.1| Borrelia crocidurae str. Achema plasmid unnamed, complete sequence))

So, i need to match the RefSeq IDs (e.g NC_017775.1) from the fasta header instead of matching the entire header. any suggestions?

ADD REPLY • link 10.3 years ago by bioinfo ▴ 840

0

Entering edit mode

to use tabix, file should be indexed with tabix on the NCBI side: They're not. Search biostars to 'grep' on fasta file on its name.

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 164k

score 0 · Answer 3 · 2014-09-10

0

Entering edit mode

10.2 years ago

Pierre Lindenbaum 164k

A few weeks later, to complete what @elucify said. You can group your input with xargs:

seq 25 100 |\
xargs -n5 -r echo | tr " " "," |\
while read F; do curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id=${F}" ; done

ADD COMMENT • link 10.2 years ago by Pierre Lindenbaum 164k