Downloading Reference Proteomes Using OTU identifiers
1
0
Entering edit mode
2.3 years ago

Hi,

I am trying to download reference proteomes for a list of amplicon OTUs to generate pseudo-metagenomes. There are a couple tools, ncbi_datasets and edirect, that I have tried, but I have not had success downloading only the reference proteome sequences. I will eventually need to run this in parallel to extract and piece together many fasta files.

Concerning the conda ncbi_datasets, how would I filter to only obtain the protein.faa file? Does an --exclude function exist for this function, or might there be a way to only obtain these protein sequences?

Secondly, is --refseq the correct database I should be searching, and how does it compare to UniParc? Would obtaining the taxids using ncbi_datasets and then utilizing the API download function from UniProt be a better place to obtain these proteomes? I do not know how to proceed. Below is my code and output.


Here is my example datasets code I have been using to start by obtaining only one OTU's proteome/s:

datasets download genome taxon "Bacteroides thetaiotaomicron" --exclude-gff3 --exclude-rna --exclude-seq --exclude-genomic-cds --refseq --reference --dehydrated --filename Bacteriodesthetaiomicron.zip\
unzip Bacteriodesthetaiomicron.zip -d Bacteriodesthetaiomicron\
datasets rehydrate --directory Bacteriodesthetaiomicron

Here is the result:

Found 2 files for rehydration
Completed 2 of 2 [================================================] 100%
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/protein.faa    2.23MB done
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/sequence_report.jsonl    424B done

(ncbi_datasets) % ls

Bacteriodesthetaiomicron/     Bacteriodesthetaiomicron.zip

(ncbi_datasets) % cd Bacteriodesthetaiomicron/

(ncbi_datasets)  Bacteriodesthetaiomicron % ls

README.md     ncbi_dataset/

(ncbi_datasets) Bacteriodesthetaiomicron % cd ncbi_dataset 

(ncbi_datasets) ncbi_dataset % ls

data/      fetch.txt

(ncbi_datasets) ncbi_dataset % cd data 

(ncbi_datasets) data % ls

GCF_014131755.1/            assembly_data_report.jsonl  dataset_catalog.json

(ncbi_datasets) data % cd GCF_014131755.1 

(ncbi_datasets) GCF_014131755.1 % ls

protein.faa            sequence_report.jsonl
genome datasets proteome command-line 16S • 616 views
ADD COMMENT
2
Entering edit mode
2.3 years ago
GenoMax 142k

how would I filter to only obtain the protein.faa file?

Isn't that what you obtained above? Folder hierarchy that the data is downloaded in is part of how datasets works.

Since datasets tool downloads all files with generic name protein.faa you will want to use the instructions here to rename the files: NCBI datasets bulk protein fasta download

Take a look at alternate tools (click --> NCBI datasets bulk protein fasta download ) for alternatives to NCBI datasets.

ADD COMMENT

Login before adding your answer.

Traffic: 1690 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6