Question

how to retrieve protein.fasta from genome in NCBI

0

Entering edit mode

23 months ago

beginner123 ▴ 30

Hi all, Now I'm trying to get fasta file of CDSs (with protein) on the genome, but I have no idea which data to use. What is the difference between the data obtained from 1 and 2? I want to use RefSeq data.

Thank you in advance.

NCBI • 1.3k views

ADD COMMENT • link 23 months ago by beginner123 ▴ 30

1

Entering edit mode

Which genome? What is 1 and 2? Please provide accession number example.

ADD REPLY • link 23 months ago by GenoMax 142k

1

Entering edit mode

enter image description here

　Sorry I forgot to attach the image.

ADD REPLY • link 23 months ago by beginner123 ▴ 30

score 1 · Accepted Answer · 2022-06-01

1

Entering edit mode

23 months ago

GenoMax 142k

Multiple ways to do this;

Using EntrezDirect

If you want protein sequence: $ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta > protein.fa

Using NCBI FTP site for this genome (which you can see by hovering over the genome or protein links in the screenshot you posted above)

Visit : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/

Get Nucleotide CDS: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_cds_from_genomic.fna.gz

Get Protein Sequence: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_protein.faa.gz

NCBI Datasets tool

You can also use NCBI Datasets tool: https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/706587/

Search for your genome at: https://www.ncbi.nlm.nih.gov/datasets/ I provided a direct link above.

ADD COMMENT • link 23 months ago by GenoMax 142k

0

Entering edit mode

Thank you very much. If I download faa.gz file of this genome from the NCBI FTP site, does this file contain the protein sequences from both NC_018025.1 and NC_018026.1?

ADD REPLY • link 23 months ago by beginner123 ▴ 30

1

Entering edit mode

Looks like it.

$ zgrep "^>" GCF_000266945.1_ASM26694v1_protein.faa.gz | wc -l
5199

Using EntrezDirect on individual accessions yields

$ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta | grep ">" | wc -l
    5176

$ esearch -db nuccore -query NC_018026 | elink -target protein | efetch -format fasta | grep ">" | wc -l
      23

ADD REPLY • link 23 months ago by GenoMax 142k

0

Entering edit mode

Thank you so much! Now I understand how it works.

ADD REPLY • link 23 months ago by beginner123 ▴ 30

score 1 · Accepted Answer · 2022-06-01

You already found some files, but here is a general approach to finding genomic assemblies. From the main NCBI site:

https://ncbi.nlm.nih.gov/

In the search box we type the organism of interest, and in search results we select Assembly. In your case there are 6 of them, so you pick whichever you like best. I am going with Desulfomonile tiedjei DSM 6799.

https://www.ncbi.nlm.nih.gov/assembly/?term=desulfomonile%20tiedjei

On the right-hand side you click on "FTP for RefSeq assembly". From that directory you want a file that ends in .faa.gz for protein translation.

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_protein.faa.gz

The README file explains what the others are.