Question: getting feature annotations for all NCBI Refseq sequences
0
gravatar for bitpir
6 days ago by
bitpir70
bitpir70 wrote:

Hi, I was wondering if there's a way to download all feature annotations (Gene; CDS; rRNA; tRNA; ncRNA; repeat_region) of all the Refseq sequences (ftp://ftp.ncbi.nih.gov/refseq/release/) from NCBI? I can't seem to find it anywhere on the web or NCBI. Something like GFF for Genbank sequences would be great. Thanks!

cds refseq annotation ncbi • 103 views
ADD COMMENTlink modified 6 days ago by genomax47k • written 6 days ago by bitpir70
2
gravatar for genomax
6 days ago by
genomax47k
United States
genomax47k wrote:

Get summary file for NCBI RefSeq genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Grab the ftp path (and or any other fields you need) from this file for each genome.

awk -F '\t' '{print $20}' assembly_summary_refseq.txt > ftp_paths

You should get something like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/545/GCF_000599545.1_ASM59954v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/565/GCF_000599565.1_TruePyoMS2391.0
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/605/GCF_000599605.1_HJM029

From each of the ftp path directory you should be able to get the *genomic.gff.gz file for that genome.

ADD COMMENTlink written 6 days ago by genomax47k

Awesome! Thank you so much for your answer, this is super helpful! Are the sequences from the Refseq/release represented in the assembly_summary_refseq.txt? And in case of viruses e.g. https://www.ncbi.nlm.nih.gov/nuccore/LC340031.1, is there a similar assembly_summary file where I can get all the annotations? Is there also a NC_ to assembly type of file somewhere? I can see it when I search for the NC number but can never find the file list easily... Thanks a lot for your help!

ADD REPLYlink written 5 days ago by bitpir70

There is a similar summary file for virii. You can find that here.

What exactly do you mean by NC_ to assembly type? Can you give an example?

ADD REPLYlink written 5 days ago by genomax47k

great, thank you! My virus db includes View all RefSeq and Neighbor nucleotide records, will all of the viruses be captured in the summary file? I was thinking of finding all the NC_ accession number to a particular assembly accession number (e.g. NC_011750.1 --> GCF_000026345.1). I have downloaded the refseq genomic sequences, and I'm kind of working backwards to get the see which NC_ is associated with which assembly number and getting their respective annotation.

ADD REPLYlink written 5 days ago by bitpir70

Ah! Just found the answer to my question, for the second part at least :)

ADD REPLYlink written 5 days ago by bitpir70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1038 users visited in the last hour