Question

Extract protein files from nucleotide accession numbers

0

Entering edit mode

4.1 years ago

genomes_and_MGEs ▴ 10

Hey everyone,

I would like to extract all proteins, using a list of nucleotide accession numbers as input. For example, considering the list.txt with the following accessions:

NC_008803.1
NCVQ01000001.1
NC_039364.1
NC_005101.4

I would like to extract the protein files for all coding sequences within each one of these nucleotide accession numbers.

Thanks!

sequence • 1.3k views

ADD COMMENT • link updated 4.1 years ago by vkkodali_ncbi ★ 3.7k • written 4.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 4.1 years ago by Ram 43k

0

Entering edit mode

see Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins

ADD REPLY • link 4.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

If you get the assembly accession number and download the protein.faa, and extract the relevant ones, would it work for you?

For example:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/295/GCF_000002295.2_MonDom5/

GCF_000002295.2_MonDom5_protein.faa.gz

ADD REPLY • link 4.1 years ago by Fatima ▴ 1000

score 0 · Answer 1 · 2020-03-27

0

Entering edit mode

4.1 years ago

vkkodali_ncbi ★ 3.7k

You can use Entrez Direct for this as shown below:

$ cat accs.txt
NC_008803.1
NCVQ01000001.1
NC_039364.1
NC_005101.4

$ epost -db nuccore -input accs.txt -format acc \
| elink -target protein \
| efetch -format fasta 
>NP_001008767.1 thioredoxin-interacting protein [Rattus norvegicus]
MVMFKKIKSFEVVFNDPEKVYGSGEKVAGRVTVEVCEVTRVKAVRILACGVAKVLWMQGSQQCKQTLDYL

However, for the four accessions in your list, nearly 24000 proteins are returned. Downloading that many proteins using efetch can quickly become a time-consuming process. If you are doing this for entire chromosomes, you may be better off with the following three-step approach:

use efetch with the parameter -format acc to download a list of protein accessions
downloading the entire protein datasets for the organisms of your interest from NCBI FTP
use a different program such as seqkit to extract the specific protein accessions of interest

ADD COMMENT • link 4.1 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks for the reply. Your solution works well, but it outputs tons of proteins with no link to the chromosome. I would like to link the extracted proteins to the chromosome. Any solution?

ADD REPLY • link 4.1 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Which solution are you talking about? The one using efetch or the one where you download from from FTP path?

If you download the entire protein.faa.gz file(s) from FTP, there is another file ending in feature_table.txt.gz in the same path. It should have information about which chromosome each protein is annotated on.

If you want to do this using esearch/efetch method then you'd have to skip the epost step and do this for each acc using a bash loop as shown below:

for acc in `cat accs.txt`; do 
esearch -db nuccore -query ${acc} \
    | elink -target protein \
    | efetch -format acc \
    | sed "s/^/${acc}\t/g" ; 
done

This will produce a tab-delimited file with <chromosome> <tab> <protein_acc> fields.

ADD REPLY • link 4.1 years ago by vkkodali_ncbi ★ 3.7k