Question: getting feature annotations for all NCBI Refseq sequences
0
gravatar for bitpir
3 months ago by
bitpir80
bitpir80 wrote:

Hi, I was wondering if there's a way to download all feature annotations (Gene; CDS; rRNA; tRNA; ncRNA; repeat_region) of all the Refseq sequences (ftp://ftp.ncbi.nih.gov/refseq/release/) from NCBI? I can't seem to find it anywhere on the web or NCBI. Something like GFF for Genbank sequences would be great. Thanks!

cds refseq annotation ncbi • 204 views
ADD COMMENTlink modified 3 months ago by genomax54k • written 3 months ago by bitpir80
2
gravatar for genomax
3 months ago by
genomax54k
United States
genomax54k wrote:

Get summary file for NCBI RefSeq genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Grab the ftp path (and or any other fields you need) from this file for each genome.

awk -F '\t' '{print $20}' assembly_summary_refseq.txt > ftp_paths

You should get something like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/545/GCF_000599545.1_ASM59954v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/565/GCF_000599565.1_TruePyoMS2391.0
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/605/GCF_000599605.1_HJM029

From each of the ftp path directory you should be able to get the *genomic.gff.gz file for that genome.

ADD COMMENTlink written 3 months ago by genomax54k

Awesome! Thank you so much for your answer, this is super helpful! Are the sequences from the Refseq/release represented in the assembly_summary_refseq.txt? And in case of viruses e.g. https://www.ncbi.nlm.nih.gov/nuccore/LC340031.1, is there a similar assembly_summary file where I can get all the annotations? Is there also a NC_ to assembly type of file somewhere? I can see it when I search for the NC number but can never find the file list easily... Thanks a lot for your help!

ADD REPLYlink written 3 months ago by bitpir80

There is a similar summary file for virii. You can find that here.

What exactly do you mean by NC_ to assembly type? Can you give an example?

ADD REPLYlink written 3 months ago by genomax54k

great, thank you! My virus db includes View all RefSeq and Neighbor nucleotide records, will all of the viruses be captured in the summary file? I was thinking of finding all the NC_ accession number to a particular assembly accession number (e.g. NC_011750.1 --> GCF_000026345.1). I have downloaded the refseq genomic sequences, and I'm kind of working backwards to get the see which NC_ is associated with which assembly number and getting their respective annotation.

ADD REPLYlink written 3 months ago by bitpir80

Ah! Just found the answer to my question, for the second part at least :)

ADD REPLYlink written 3 months ago by bitpir80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 705 users visited in the last hour