NCBI datasets bulk protein fasta download
2
2
Entering edit mode
2.4 years ago
emawhitt ▴ 20

Hi,

I want to download protein fasta files for a set of bird species. I have the genome assembly accessions in a file. I feel like every time I need to bulk download fasta files I've forgotten how I did it last time and the databases have all changed their websites/interfaces. I used the NCBI databases command line to download the files. However, datasets gives each accession its own folder containing "protein.faa". What I want is a single folder with fasta files so I can then use this in Orthofinder and other programmes. It's essentially useless to have a few hundred folders containing a file with the same name. Does anyone know the best way to download these files (and a way that will remain the best way and I can use it again in the future) or figure out how to use the downloads from datasets? Thank you.

NCBI datasets • 2.5k views
ADD COMMENT
0
Entering edit mode

Assuming you are on Linux, if all your downloaded directories are in /home/emawhitt/fasdls then simply run:

mv /home/emawhitt/fasdls/*/*.faa /home/emawhitt/fasdls

This will move all the .faa files to /home/emawhitt/fasdls.

ADD REPLY
0
Entering edit mode

thank you. Is there some way to rename the files too? All the files are called protein.faa

ADD REPLY
0
Entering edit mode

This would move and rename the files at once (might be a bit slow depending on how many files you have). Just replace /home/emawhitt/fasdls that is assigned to MYPATH right now with whatever is the path to the directory containing all the protein.faa files (in their respective sub-directories).

MYPATH="/home/emawhitt/fasdls"; cd ${MYPATH}; find . -maxdepth 2 -type f -name "*.faa" -exec sh -c 'DIR=$(basename $(dirname "{}")); mv "{}" ./${DIR}_protein.faa' \;

MirianT_NCBI 's solution down below might be a bit faster though.

ADD REPLY
3
Entering edit mode
2.4 years ago
MirianT_NCBI ▴ 720

From the ncbi_dataset folder, you can run this one-liner:

mkdir proteins; for f in data/*/protein.faa; do out=$( echo $f | cut -f2 -d'/'); cp $f proteins/${out}.faa; done

This command will create a folder proteins, and copy each protein.faa file to the folder proteins while renaming them with the respective genome accession number. Let me know if that's helpful. If you need something different, I'll be happy to help with that too.

ADD COMMENT
0
Entering edit mode

This worked perfectly. Thank you!

ADD REPLY
0
Entering edit mode
2.4 years ago
GenoMax 141k

Not sure what "databases command line" you used but consider downloading the data using ncbi-genome-download tool (LINK) or genome_updater (LINK). This should avoid "all files named protein.faa" issue.

ADD COMMENT
0
Entering edit mode

Thank you, I will take a look at those links. I used the NCBI dataset command line tool (LINK).

ADD REPLY

Login before adding your answer.

Traffic: 2372 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6