Question

RefSeq Annotation Report - Gene and Feature Statistics

3

Entering edit mode

3.0 years ago

mglasena ▴ 40

Hello,

I have a question about RefSeq annotation reports, particularly the one for the purple urchin. The RefSeq Annotation Report indicates that there are 258,355 exons. enter image description here However, the gff file for this assembly has 442,528 lines in which column three has the value of "exon." What would explain this discrepancy?

For example, the following code returns "442528."

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7668/102/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | wc -l

I see that there is a note on the annotation report indicating that the counts do not include pseudogenes. There is also an additional note next to the exons row that states:

"Exons in mRNAs, misc_RNAs and ncRNAs of class lncRNA. Does not include tRNAs, rRNAs or ncRNAs of class other than lncRNA. Exons shared by multiple transcripts are counted once."

I'm not sure that this could account for the 184,173 exon difference. I am hoping to compute sequencing coverage statistics across exons.

gff ncbi gff3 annotation refseq • 1.2k views

ADD COMMENT • link 3.0 years ago by mglasena ▴ 40

score 1 · Answer 1 · 2021-05-13

1

Entering edit mode

3.0 years ago

Jeremy Leipzig 22k

I don't think you can rely on exon identifiers because they are attached to redundant transcripts. If you just count unique exon-start-stop I think you can get pretty close, though not sure why I can't nail 258,355

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7668/102/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | grep -v tRNA | grep -v rRNA | grep -v misc_RNA | cut -f 3,4,5 | grep exon | sort | uniq | wc -l
  257571

ADD COMMENT • link 3.0 years ago by Jeremy Leipzig 22k

1

Entering edit mode

RefSeq version of the genome is in a different location but produces identical results:

 wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Strongylocentrotus_purpuratus/annotation_releases/current/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | grep -v tRNA | grep -v rRNA | grep -v misc_RNA | cut -f 3,4,5 | grep exon | sort | uniq | wc -l
257571

RefSeq annotation report says

Annot

Excluding just tRNA and rRNA gets you to the closest number.

$ wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Strongylocentrotus_purpuratus/annotation_releases/current/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | grep -v tRNA | grep -v rRNA | cut -f 3,4,5 | grep exon | sort | uniq | wc -l
**258710**

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

Thanks. This was helpful. I realized what I probably want is just the exons in coding transcripts. The closest I was able to get to 245,575 was 246,202

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7668/102/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz| gunzip --stdout | awk '$3 == "exon"' | grep 'ID=exon-XM_\|ID=exon-NM_' | cut -f 1,3,4,5 | sort | uniq | wc -l
246202

I'm not sure if grep -v rRNA is appropriate because it would exclude too many cases. For example, this pre-rRNA-processing protein would be excluded

NW_022145612.1  Gnomon  exon    39921083    39921227    .   -   .   ID=exon-XM_030999899.1-14;Parent=rna-XM_030999899.1;Dbxref=GeneID:594000,Genbank:XM_030999899.1;gbkey=mRNA;gene=LOC594000;product=pre-rRNA-processing protein TSR1 homolog;transcript_id=XM_030999899.1

I believe the tRNAs and rRNAs can be filtered with:

grep -v gbkey=rRNA
grep -v gbkey=tRNA

I'm not sure how to distinguish between ncRNAs of class lncRNA and ncRNAs of class other than lncRNA.

ADD REPLY • link 3.0 years ago by mglasena ▴ 40