Question: Download all assemblies in a bioproject
0
gravatar for dazhudou1122
20 months ago by
dazhudou1122110
dazhudou1122110 wrote:

Dear Biostars community,

I am trying to download all the assembly in a bioproject: https://www.ncbi.nlm.nih.gov/bioproject/?term=474907 Can anyone tell me how to download them all without manually copying the link and download the assembly like this: wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1 ./

Any info will be greatly appreciated!

Best,

Wenhan

assembly • 1.4k views
ADD COMMENTlink modified 20 months ago by GenoMax96k • written 20 months ago by dazhudou1122110
  1. Visit SRA-explorer.
  2. Search with term PRJNA474907[All Fields]
  3. Select all results add to collection.
  4. Go to your shopping cart (button at top of page).
  5. Select one of the convenient options to download.

Edit: Moved to comment since OP wants to get assembled data not original fastq files.

ADD REPLYlink modified 20 months ago • written 20 months ago by GenoMax96k

Thank you genomax! It is a really nice way to do it! But I will need the translated protein sequences files (faa files) and this method does not seem to be able to do that...am I missing something?

ADD REPLYlink written 20 months ago by dazhudou1122110

Crossposted at: http://seqanswers.com/forums/showthread.php?t=89870

ADD REPLYlink written 20 months ago by GenoMax96k
1
gravatar for lieven.sterck
20 months ago by
lieven.sterck10.0k
VIB, Ghent, Belgium
lieven.sterck10.0k wrote:

why not set up on ftp session to the ncbi ftp site

ftp ftp.ncbi.nlm.nih.gov

when asked for name, type anonymous and then follow the instructions for password.

after that, cd to the folder you need (cfr url part: genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1/ ), then do mget * or get of a file you need to download them/it

ADD COMMENTlink modified 20 months ago • written 20 months ago by lieven.sterck10.0k
1
gravatar for AK
20 months ago by
AK2.0k
Taipei
AK2.0k wrote:

Hi dazhudou1122,

Also Entrez Direct:

(1) If you want to download all the files in each assembly directory:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/GCF_.+)<.+|\1|' \
  > list_asm.txt

wget -r -R "index.html" --no-host-directories --cut-dirs=6 -i list_asm.txt

Where list_asm.txt contains:

$ head -n3 list_asm.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1

(2) If you want to download the assembly sequences only:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCF_.+)<.+|\1\2/\2_genomic.fna.gz|' \
  > list_fna.txt

wget -i list_fna.txt

Where list_fna.txt contains:

$ head -n3 list_fna.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2/GCF_004794035.2_ASM479403v2_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1/GCF_005046025.1_ASM504602v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1/GCF_004801635.1_ASM480163v1_genomic.fna.gz
ADD COMMENTlink modified 20 months ago • written 20 months ago by AK2.0k

Thank you so much! I was trying to download every using method (1) but I encountered the following error (up to the "> list_asm.txt" part) : Can't locate JSON/PP.pm in @INC (@INC contains: /home2/s154806/edirect/aux/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /home2/s154806/edirect/edirect.pl line 72. BEGIN failed--compilation aborted at /home2/s154806/edirect/edirect.pl line 72.

Any ideas?

ADD REPLYlink written 20 months ago by dazhudou1122110

Hi dazhudou1122,

You need to install the dependent Perl modules for EDirect, which can be done by, for example to install JSON::PP: sudo perl -MCPAN -e shell, followed by install JSON::PP. Once the installation is done you can test by: perl -e 'use JSON::PP'.

ADD REPLYlink modified 20 months ago • written 20 months ago by AK2.0k
1
gravatar for GenoMax
20 months ago by
GenoMax96k
United States
GenoMax96k wrote:

Since OP wants to download genome sequence along with protein sequence here is another option.

  1. Go to the Assembly links page for this Bioproject.
  2. Change "Items per page setting" to 50.
  3. Select all assemblies by clicking "Check box".
  4. Hit Download Assemblies button.
  5. Select source database and then choose file type (your probably want Protein fasta and Genomic Fasta, so two separate downloads).
  6. Download.
ADD COMMENTlink modified 20 months ago • written 20 months ago by GenoMax96k

Thank you! That's very helpful!

ADD REPLYlink written 20 months ago by dazhudou1122110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 970 users visited in the last hour
_