Hi all, Now I'm trying to get fasta file of CDSs (with protein) on the genome, but I have no idea which data to use. What is the difference between the data obtained from 1 and 2? I want to use RefSeq data.
Thank you in advance.
Hi all, Now I'm trying to get fasta file of CDSs (with protein) on the genome, but I have no idea which data to use. What is the difference between the data obtained from 1 and 2? I want to use RefSeq data.
Thank you in advance.
Multiple ways to do this;
If you want protein sequence: $ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta > protein.fa
genome
or protein
links in the screenshot you posted above)Visit : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/
Get Nucleotide CDS: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_cds_from_genomic.fna.gz
Get Protein Sequence: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/266/945/GCF_000266945.1_ASM26694v1/GCF_000266945.1_ASM26694v1_protein.faa.gz
You can also use NCBI Datasets tool: https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/706587/
Search for your genome at: https://www.ncbi.nlm.nih.gov/datasets/ I provided a direct link above.
Looks like it.
$ zgrep "^>" GCF_000266945.1_ASM26694v1_protein.faa.gz | wc -l
5199
Using EntrezDirect on individual accessions yields
$ esearch -db nuccore -query NC_018025 | elink -target protein | efetch -format fasta | grep ">" | wc -l
5176
$ esearch -db nuccore -query NC_018026 | elink -target protein | efetch -format fasta | grep ">" | wc -l
23
You already found some files, but here is a general approach to finding genomic assemblies. From the main NCBI site:
In the search box we type the organism of interest, and in search results we select Assembly. In your case there are 6 of them, so you pick whichever you like best. I am going with Desulfomonile tiedjei DSM 6799.
https://www.ncbi.nlm.nih.gov/assembly/?term=desulfomonile%20tiedjei
On the right-hand side you click on "FTP for RefSeq assembly". From that directory you want a file that ends in .faa.gz
for protein translation.
The README file explains what the others are.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Which genome? What is 1 and 2? Please provide accession number example.
 Sorry I forgot to attach the image.