Question

Download all assemblies in a bioproject

2

Entering edit mode

4.8 years ago

dazhudou1122 ▴ 140

Dear Biostars community,

I am trying to download all the assembly in a bioproject: https://www.ncbi.nlm.nih.gov/bioproject/?term=474907 Can anyone tell me how to download them all without manually copying the link and download the assembly like this: wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1 ./

Any info will be greatly appreciated!

Best,

Wenhan

assembly • 5.2k views

ADD COMMENT • link updated 4.8 years ago by GenoMax 141k • written 4.8 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

Visit SRA-explorer.
Search with term PRJNA474907[All Fields]
Select all results add to collection.
Go to your shopping cart (button at top of page).
Select one of the convenient options to download.

Edit: Moved to comment since OP wants to get assembled data not original fastq files.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you genomax! It is a really nice way to do it! But I will need the translated protein sequences files (faa files) and this method does not seem to be able to do that...am I missing something?

ADD REPLY • link 4.8 years ago by dazhudou1122 ▴ 140

0

Entering edit mode

Crossposted at: http://seqanswers.com/forums/showthread.php?t=89870

ADD REPLY • link 4.8 years ago by GenoMax 141k

score 1 · Answer 1 · 2019-06-21

why not set up on ftp session to the ncbi ftp site

ftp ftp.ncbi.nlm.nih.gov

when asked for name, type anonymous and then follow the instructions for password.

after that, cd to the folder you need (cfr url part: genomes/all/GCF/004/793/975/GCF_004793975.1_ASM479397v1/ ), then do mget * or get of a file you need to download them/it

score 1 · Answer 2 · 2019-06-21

Hi dazhudou1122,

Also Entrez Direct:

(1) If you want to download all the files in each assembly directory:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/GCF_.+)<.+|\1|' \
  > list_asm.txt

wget -r -R "index.html" --no-host-directories --cut-dirs=6 -i list_asm.txt

Where list_asm.txt contains:

$ head -n3 list_asm.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1

(2) If you want to download the assembly sequences only:

esearch -db bioproject -query 474907 \
  | elink -target assembly \
  | esummary \
  | grep "FtpPath_RefSeq" \
  | sed -r 's|.+>(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCF_.+)<.+|\1\2/\2_genomic.fna.gz|' \
  > list_fna.txt

wget -i list_fna.txt

Where list_fna.txt contains:

$ head -n3 list_fna.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/794/035/GCF_004794035.2_ASM479403v2/GCF_004794035.2_ASM479403v2_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/046/025/GCF_005046025.1_ASM504602v1/GCF_005046025.1_ASM504602v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/801/635/GCF_004801635.1_ASM480163v1/GCF_004801635.1_ASM480163v1_genomic.fna.gz

score 1 · Answer 3 · 2019-06-21

1

Entering edit mode

4.8 years ago

GenoMax 141k

Since OP wants to download genome sequence along with protein sequence here is another option.

Go to the Assembly links page for this Bioproject.
Change "Items per page setting" to 50.
Select all assemblies by clicking "Check box".
Hit Download Assemblies button.
Select source database and then choose file type (your probably want Protein fasta and Genomic Fasta, so two separate downloads).
Download.