Question: Downloading plasmid sequences using refseq ids
0
gravatar for bioinfo
4.7 years ago by
bioinfo690
EU
bioinfo690 wrote:

Is there any easy way to download Plasmids from NCBI plasmids site (ftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids) using a list of RefSeq ids in a file ? 

RefSeq ids looks like as below in a txt file:

NC_017775

NC_017810
NC_017776
NC_017777
.........

NC_017811
NC_017778
NC_017779

 

fasta plasmids sequence perl ncbi • 3.1k views
ADD COMMENTlink modified 4.6 years ago by Pierre Lindenbaum118k • written 4.7 years ago by bioinfo690
1
gravatar for elucify
4.6 years ago by
elucify10
elucify10 wrote:

Easier solution

curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_017775,NC_017810&rettype=fasta'

To get more sequences, just add refseq accession numbers, comma-separated, to the list. This will work until the URL is up to about 2k in length. If you have more sequences than that, you can break your list of ids into comma-separated blocks of just under 2k, and iterate over the blocks.

ADD COMMENTlink written 4.6 years ago by elucify10
0
gravatar for Pierre Lindenbaum
4.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

search this site for Efetch:

 echo -e "NC_017775\nNC_017810" | while read ACN; do curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${ACN}&retmode=text&rettype=fasta" ; done
ADD COMMENTlink written 4.7 years ago by Pierre Lindenbaum118k

I was thinking if I just download all the sequences from the ftp site first then extract the required fasta sequences using the RefSeq id list in the txt file then that should work as well. I just tried this but got stuck.

cut -c 1- test.id.txt | xargs -n 1 samtools faidx large.dowloaded.plasmid.fasta (it requires the entire Fasta header as ID in the test.id.txt file.

the entire header line e.g. gi|386858858|ref|NC_017775.1| Borrelia crocidurae str. Achema plasmid unnamed, complete sequence))

So, i need to match the RefSeq IDs (e.g NC_017775.1) from the fasta header instead of matching the entire header. any suggestions?

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by bioinfo690

to use tabix, file should be indexed with tabix on the NCBI side: They're not. Search biostars to 'grep' on fasta file on its name.

ADD REPLYlink written 4.7 years ago by Pierre Lindenbaum118k
0
gravatar for Pierre Lindenbaum
4.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:
A few weeks later, to complete what @elucify said. You can group your input with xargs:
seq 25 100 |\
xargs -n5 -r echo | tr " " "," |\
while read F; do curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id=${F}" ; done
ADD COMMENTlink written 4.6 years ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1216 users visited in the last hour