Question: How to download genome assemblies from NCBI with a list of GCA identifiers?
0
gravatar for O.rka
2.1 years ago by
O.rka220
O.rka220 wrote:

I went to https://www.ncbi.nlm.nih.gov/assembly, typed in my organism, and now I want to download all of the assemblies that pop up. If I click [Download Assemblies] then it only downloads 1/22 of them and it's been saying "calculating size..." for about 30 minutes now. I tried using https://github.com/kblin/ncbi-genome-download but not all of the records were downloaded

download data assembly ncbi • 2.2k views
ADD COMMENTlink modified 2.1 years ago by vkkodali2.2k • written 2.1 years ago by O.rka220
1

Are you sure you set the right filters on ngd, such as assembly level etc? It should download anything that’s present in the asssembly summary file.

ADD REPLYlink written 2.1 years ago by Joe18k

Apparently the organism I wanted had all of its records in GenBank and not RefSeq

ADD REPLYlink written 2.1 years ago by O.rka220
1

Yep, that is what I expected.

RefSeq is a subset of the total data in Genbank, that has been curated to a high degree manually. They are "reference sequences".

ADD REPLYlink written 2.1 years ago by Joe18k
1

I'm not sure if you can rely on this all the time, but IIRC the accessions starting with "GCA_" are from GenBank. Accessions from RefSeq tend to start with "GCF_".

ADD REPLYlink written 2.1 years ago by kblin10
4
gravatar for vkkodali
2.1 years ago by
vkkodali2.2k
United States
vkkodali2.2k wrote:

UPDATE 2020-11-21

If you have a list of NCBI assembly accessions with GCA or GCF prefixes, the easiest way to download data is to use the new tool NCBI Datasets. There is a web-interface to do this, if you don't want to bother with command line. If you do want to use the command line, you can use the datasets CLI tool as follows:

datasets download genome accession --inputfile assm_accs.txt --exclude-gff3 --exclude-protein --exclude-rna

where assm_accs.txt has NCBI assembly accessions, one per each line.

Note Currently, only the latest assembly accessions for a taxon are in the scope of NCBI Datasets. If you want to download older assemblies, you will have to use Entrez Direct as follows:

esearch -db assembly -query 'Bos taurus[organism] AND latest[filter]' \
    | esummary \
    | xtract -pattern DocumentSummary -element FtpPath_GenBank \
    | while read -r line ; 
    do
        fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ;
        wget "$line/$fname" ;
    done

Here, I am first fetching the FTP path for the GenBank assembly using edirect tools and then use standard linux commands to download the genomic fasta file.

If your starting point is a file with a list of NCBI assembly accessions, you can wrap the command shown above in a bash loop like this:

cat assm_accs.txt | while read -r acc ; do
    esearch -db assembly -query $acc </dev/null \
        | esummary \
        | xtract -pattern DocumentSummary -element FtpPath_GenBank \
        | while read -r url ; do
            fname=$(echo $url | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/') ;
            wget "$url/$fname" ;
        done ;
    done
ADD COMMENTlink modified 3 days ago • written 2.1 years ago by vkkodali2.2k

Hi! Great solution!

Do you think that would be possible to extract the list of downloaded GCAs, ie to the GCA_list.txt file?

ADD REPLYlink written 9 months ago by agata10

There are any way to read the query from a list.txt?

ADD REPLYlink written 4 days ago by psschlogl30
1

Yes, I updated my answer to include additional options.

ADD REPLYlink written 3 days ago by vkkodali2.2k

thank you. It works very well

ADD REPLYlink written 2 days ago by psschlogl30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2125 users visited in the last hour