Download all assemblies in a bioproject
3
2
Entering edit mode
4.8 years ago
dazhudou1122 ▴ 140

Dear Biostars community,

I am trying to download all the assembly in a bioproject: https://www.ncbi.nlm.nih.gov/bioproject/?term=474907 Can anyone tell me how to download them all without manually copying the link and download the assembly like this: wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1 ./

Any info will be greatly appreciated!

Best,

Wenhan

assembly • 5.2k views
ADD COMMENT
0
Entering edit mode
  1. Visit SRA-explorer.
  2. Search with term PRJNA474907[All Fields]
  3. Select all results add to collection.
  4. Go to your shopping cart (button at top of page).
  5. Select one of the convenient options to download.

Edit: Moved to comment since OP wants to get assembled data not original fastq files.

ADD REPLY
0
Entering edit mode

Thank you genomax! It is a really nice way to do it! But I will need the translated protein sequences files (faa files) and this method does not seem to be able to do that...am I missing something?

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
4.8 years ago

why not set up on ftp session to the ncbi ftp site

ftp ftp.ncbi.nlm.nih.gov

when asked for name, type anonymous and then follow the instructions for password.

after that, cd to the folder you need (cfr url part: genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1/ ), then do mget * or get of a file you need to download them/it

ADD COMMENT
1
Entering edit mode
4.8 years ago
AK ★ 2.2k

Hi dazhudou1122,

Also Entrez Direct:

(1) If you want to download all the files in each assembly directory:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/GCF_.+)<.+|\1|' \
  > list_asm.txt

wget -r -R "index.html" --no-host-directories --cut-dirs=6 -i list_asm.txt

Where list_asm.txt contains:

$ head -n3 list_asm.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1

(2) If you want to download the assembly sequences only:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCF_.+)<.+|\1\2/\2_genomic.fna.gz|' \
  > list_fna.txt

wget -i list_fna.txt

Where list_fna.txt contains:

$ head -n3 list_fna.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2/GCF_004794035.2_ASM479403v2_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1/GCF_005046025.1_ASM504602v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1/GCF_004801635.1_ASM480163v1_genomic.fna.gz
ADD COMMENT
0
Entering edit mode

Thank you so much! I was trying to download every using method (1) but I encountered the following error (up to the "> list_asm.txt" part) : Can't locate JSON/PP.pm in @INC (@INC contains: /home2/s154806/edirect/aux/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /home2/s154806/edirect/edirect.pl line 72. BEGIN failed--compilation aborted at /home2/s154806/edirect/edirect.pl line 72.

Any ideas?

ADD REPLY
0
Entering edit mode

Hi dazhudou1122,

You need to install the dependent Perl modules for EDirect, which can be done by, for example to install JSON::PP: sudo perl -MCPAN -e shell, followed by install JSON::PP. Once the installation is done you can test by: perl -e 'use JSON::PP'.

ADD REPLY
1
Entering edit mode
4.8 years ago
GenoMax 141k

Since OP wants to download genome sequence along with protein sequence here is another option.

  1. Go to the Assembly links page for this Bioproject.
  2. Change "Items per page setting" to 50.
  3. Select all assemblies by clicking "Check box".
  4. Hit Download Assemblies button.
  5. Select source database and then choose file type (your probably want Protein fasta and Genomic Fasta, so two separate downloads).
  6. Download.
ADD COMMENT
0
Entering edit mode

Thank you! That's very helpful!

ADD REPLY

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6