Question: getting feature annotations for all NCBI Refseq sequences
0
gravatar for bitpir
5 months ago by
bitpir90
bitpir90 wrote:

Hi, I was wondering if there's a way to download all feature annotations (Gene; CDS; rRNA; tRNA; ncRNA; repeat_region) of all the Refseq sequences (ftp://ftp.ncbi.nih.gov/refseq/release/) from NCBI? I can't seem to find it anywhere on the web or NCBI. Something like GFF for Genbank sequences would be great. Thanks!

cds refseq annotation ncbi • 315 views
ADD COMMENTlink modified 5 months ago by genomax57k • written 5 months ago by bitpir90
2
gravatar for genomax
5 months ago by
genomax57k
United States
genomax57k wrote:

Get summary file for NCBI RefSeq genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Grab the ftp path (and or any other fields you need) from this file for each genome.

awk -F '\t' '{print $20}' assembly_summary_refseq.txt > ftp_paths

You should get something like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/545/GCF_000599545.1_ASM59954v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/565/GCF_000599565.1_TruePyoMS2391.0
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/605/GCF_000599605.1_HJM029

From each of the ftp path directory you should be able to get the *genomic.gff.gz file for that genome.

ADD COMMENTlink written 5 months ago by genomax57k

Awesome! Thank you so much for your answer, this is super helpful! Are the sequences from the Refseq/release represented in the assembly_summary_refseq.txt? And in case of viruses e.g. https://www.ncbi.nlm.nih.gov/nuccore/LC340031.1, is there a similar assembly_summary file where I can get all the annotations? Is there also a NC_ to assembly type of file somewhere? I can see it when I search for the NC number but can never find the file list easily... Thanks a lot for your help!

ADD REPLYlink written 5 months ago by bitpir90

There is a similar summary file for virii. You can find that here.

What exactly do you mean by NC_ to assembly type? Can you give an example?

ADD REPLYlink written 5 months ago by genomax57k

great, thank you! My virus db includes View all RefSeq and Neighbor nucleotide records, will all of the viruses be captured in the summary file? I was thinking of finding all the NC_ accession number to a particular assembly accession number (e.g. NC_011750.1 --> GCF_000026345.1). I have downloaded the refseq genomic sequences, and I'm kind of working backwards to get the see which NC_ is associated with which assembly number and getting their respective annotation.

ADD REPLYlink written 5 months ago by bitpir90

Ah! Just found the answer to my question, for the second part at least :)

ADD REPLYlink written 5 months ago by bitpir90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour