Extract protein files from nucleotide accession numbers
1
0
Entering edit mode
4.1 years ago

Hey everyone,

I would like to extract all proteins, using a list of nucleotide accession numbers as input. For example, considering the list.txt with the following accessions:

NC_008803.1
NCVQ01000001.1
NC_039364.1
NC_005101.4

I would like to extract the protein files for all coding sequences within each one of these nucleotide accession numbers.

Thanks!

sequence • 1.3k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

If you get the assembly accession number and download the protein.faa, and extract the relevant ones, would it work for you?

For example:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/295/GCF_000002295.2_MonDom5/

GCF_000002295.2_MonDom5_protein.faa.gz

ADD REPLY
0
Entering edit mode
4.1 years ago
vkkodali_ncbi ★ 3.7k

You can use Entrez Direct for this as shown below:

$ cat accs.txt
NC_008803.1
NCVQ01000001.1
NC_039364.1
NC_005101.4

$ epost -db nuccore -input accs.txt -format acc \
| elink -target protein \
| efetch -format fasta 
>NP_001008767.1 thioredoxin-interacting protein [Rattus norvegicus]
MVMFKKIKSFEVVFNDPEKVYGSGEKVAGRVTVEVCEVTRVKAVRILACGVAKVLWMQGSQQCKQTLDYL

However, for the four accessions in your list, nearly 24000 proteins are returned. Downloading that many proteins using efetch can quickly become a time-consuming process. If you are doing this for entire chromosomes, you may be better off with the following three-step approach:

  1. use efetch with the parameter -format acc to download a list of protein accessions
  2. downloading the entire protein datasets for the organisms of your interest from NCBI FTP
  3. use a different program such as seqkit to extract the specific protein accessions of interest
ADD COMMENT
0
Entering edit mode

Thanks for the reply. Your solution works well, but it outputs tons of proteins with no link to the chromosome. I would like to link the extracted proteins to the chromosome. Any solution?

ADD REPLY
0
Entering edit mode

Which solution are you talking about? The one using efetch or the one where you download from from FTP path?

If you download the entire protein.faa.gz file(s) from FTP, there is another file ending in feature_table.txt.gz in the same path. It should have information about which chromosome each protein is annotated on.

If you want to do this using esearch/efetch method then you'd have to skip the epost step and do this for each acc using a bash loop as shown below:

for acc in `cat accs.txt`; do 
esearch -db nuccore -query ${acc} \
    | elink -target protein \
    | efetch -format acc \
    | sed "s/^/${acc}\t/g" ; 
done

This will produce a tab-delimited file with <chromosome> <tab> <protein_acc> fields.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6