how to retrieve protein.fasta from genome in NCBI
2
0
Entering edit mode
23 months ago
beginner123 ▴ 30

Hi all, Now I'm trying to get fasta file of CDSs (with protein) on the genome, but I have no idea which data to use. What is the difference between the data obtained from 1 and 2? I want to use RefSeq data.

Thank you in advance.

NCBI • 1.3k views
ADD COMMENT
1
Entering edit mode

Which genome? What is 1 and 2? Please provide accession number example.

ADD REPLY
1
Entering edit mode

enter image description here

 Sorry I forgot to attach the image.

ADD REPLY
1
Entering edit mode
23 months ago
GenoMax 142k

Multiple ways to do this;

If you want protein sequence: $ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta > protein.fa


  • Using NCBI FTP site for this genome (which you can see by hovering over the genome or protein links in the screenshot you posted above)

Visit : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/

Get Nucleotide CDS: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_cds_from_genomic.fna.gz

Get Protein Sequence: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_protein.faa.gz


  • NCBI Datasets tool

You can also use NCBI Datasets tool: https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/706587/

Search for your genome at: https://www.ncbi.nlm.nih.gov/datasets/ I provided a direct link above.

ADD COMMENT
0
Entering edit mode

Thank you very much. If I download faa.gz file of this genome from the NCBI FTP site, does this file contain the protein sequences from both NC_018025.1 and NC_018026.1?

ADD REPLY
1
Entering edit mode

Looks like it.

$ zgrep "^>" GCF_000266945.1_ASM26694v1_protein.faa.gz | wc -l
5199

Using EntrezDirect on individual accessions yields

$ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta | grep ">" | wc -l
    5176

$ esearch -db nuccore -query NC_018026 | elink -target protein | efetch -format fasta | grep ">" | wc -l
      23
ADD REPLY
0
Entering edit mode

Thank you so much! Now I understand how it works.

ADD REPLY
1
Entering edit mode
23 months ago
Mensur Dlakic ★ 27k

You already found some files, but here is a general approach to finding genomic assemblies. From the main NCBI site:

https://ncbi.nlm.nih.gov/

In the search box we type the organism of interest, and in search results we select Assembly. In your case there are 6 of them, so you pick whichever you like best. I am going with Desulfomonile tiedjei DSM 6799.

https://www.ncbi.nlm.nih.gov/assembly/?term=desulfomonile%20tiedjei

On the right-hand side you click on "FTP for RefSeq assembly". From that directory you want a file that ends in .faa.gz for protein translation.

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_protein.faa.gz

The README file explains what the others are.

ADD COMMENT

Login before adding your answer.

Traffic: 2372 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6