Question: how to batch download SARS-CoV-2 sequences data from NCBI?
0
gravatar for 2001linana
6 weeks ago by
2001linana20
2001linana20 wrote:

Hi, I was trying to download SARS-CoV-2 sequences data from NCBI following this link: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049 When I click the empty box, I can only get like 200 sequences, each time. So I was wondering, is there a way to batch download all the genome sequences data with a click? Many thanks. I thought I did this earlier, but I do not quite recall.

sequencing sequence • 267 views
ADD COMMENTlink modified 6 weeks ago by vkkodali2.4k • written 6 weeks ago by 2001linana20
1

You can get the assembly ids, and download from the ftp, for example:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/

ADD REPLYlink written 6 weeks ago by Fatima890

Many thanks for your kind reply. Could you be a bit more specific then? Many thanks.

ADD REPLYlink written 6 weeks ago by 2001linana20

I clicked on the link you posted, clicked on the tab for Refseq Genome, clicked on the assembly:

https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2

Then clicked on FTP directory for GenBank assembly

You can get the fasta sequence by clicking on

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.fna.gz

And gene informations (gff format):

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/GCA_009858895.3_ASM985889v3_genomic.gff.gz

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Fatima890
1
gravatar for vkkodali
6 weeks ago by
vkkodali2.4k
United States
vkkodali2.4k wrote:

You can use NCBI Datasets for this. A dedicated page for Coronavirus Datasets is available. If you would prefer, a command line tool is also available. For example, you can use the command line tool to download SARS-Cov2 data as shown below:

datasets download virus genome taxon sars-cov-2 --complete-only --filename virus.zip
ADD COMMENTlink written 6 weeks ago by vkkodali2.4k

It looks like NCBI has 12 genomes of the original SARS virus (SARS total minus SARS-CoV-2). Can those be separately categorized in a link on the genome page?

Update: If I change the setting to All hosts from human it now shows 30246 SARS genomes but no SARS-CoV-2. Something does not seem right.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k
0
gravatar for GenoMax
6 weeks ago by
GenoMax94k
United States
GenoMax94k wrote:

I click the empty box, I can only get like 200 sequences, each time.

Try this. Do not click any boxes. Click on Download button at top. In step 2 Download All Records should be automatically selected. This downloads ALL sequences. As of today that number stands at 43676 genomes (~1.2 GB file).

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2220 users visited in the last hour
_