Question: Where can I get a breakdown of RefSeq annotation statistics?
1
gravatar for b10hazard
4 months ago by
b10hazard20
United States
b10hazard20 wrote:

GenCode has a wonderful breakdown of the number of coding/non-coding genes and transcripts here...

http://www.gencodegenes.org/stats.html

Does a similar breakdown exist for RefSeq anywhere? Thanks!

gencode refseq annotation • 351 views
ADD COMMENTlink modified 4 months ago by genomax49k • written 4 months ago by b10hazard20
3
gravatar for genomax
4 months ago by
genomax49k
United States
genomax49k wrote:

If you take a look at this hg19/GRCh37 GFF file from NCBI here is what is in there:

495245 CDS
   1 D_loop
33314 Genomic
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
   4 Y_RNA
  22 antisense_RNA
19086 cDNA_match
618452 exon
30661 gene
6405 lnc_RNA
49861 mRNA
22621 match
3097 miRNA
  31 ncRNA
2046 primary_transcript
  23 rRNA
 297 region
   1 sequence_feature
  72 snRNA
 393 snoRNA
 698 tRNA
   1 telomerase_RNA
5732 transcript
   3 vault_RNA

For GRCh38 a similar file can be found here with following statistics:

1413848 CDS
   1 D_loop
33893 Genomic
  26 Genomic%2CXM/XP/XR
   1 RNase_MRP_RNA
   1 RNase_P_RNA
   2 SRP_RNA
  15 V_gene_segment
   4 Y_RNA
  22 antisense_RNA
13572 cDNA_match
  24 centromere
1856502 exon
43504 gene
28117 lnc_RNA
114575 mRNA
22834 match
3038 miRNA
  31 ncRNA
2025 primary_transcript
  23 rRNA
 558 region
 304 sequence_feature
  62 snRNA
 389 snoRNA
 629 tRNA
   1 telomerase_RNA
16011 transcript
   3 vault_RNA
ADD COMMENTlink modified 4 months ago • written 4 months ago by genomax49k

Good answer - I was not aware that RefSeq had been compiling this info

ADD REPLYlink written 4 months ago by Kevin Blighe21k

This is not RefSeq but the files are from NCBI's human genome resource. Should be close enough.

ADD REPLYlink written 4 months ago by genomax49k

How did you parse that information from the GFF3 file?

ADD REPLYlink written 3 months ago by b10hazard20
1
cat interim_GRCh38.p10_top_level_2017-01-13.gff3 | awk '{print $3}' | sort | uniq -c
ADD REPLYlink written 3 months ago by genomax49k

That works! Thanks!

ADD REPLYlink written 3 months ago by b10hazard20
2
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

I don't believe RefSeq do as good a breakdown of transcript types as GENCODE, however, you may be interested in the following resources:

Generally, I think that you'll find that GENCODE is more comprehensive for non-coding genes; however, for the majority of these, exact function is entirely unknown. Most people filter them out of, for example, RNA-seq experiments, in order to (in part) minimise the stringency of a false discovery rate threshold. On the other hand, RefSeq has the feel of a well-curated resource.

Kevin

ADD COMMENTlink written 4 months ago by Kevin Blighe21k

What I was really hoping to get was the number of full length non-coding transcripts that refseq has for hg19. Is there anyway to get this information?

ADD REPLYlink written 4 months ago by b10hazard20

The first link above is to a published manuscript where GENCODE transcripts were compared to those of RefSeq. They used GRCh37 / hg19 transcripts. in Additional File 3 of this is a table where they compare GENCODE to RefSeq NR, which according to RefSeq are non-protein coding transcripts (or transcripts unlikely to have protein coding potential).

Additional files can be found here: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2#MOESM6

ADD REPLYlink written 4 months ago by Kevin Blighe21k

Thanks for your help with this Kevin. Those papers were excellent resources and they covered a lot of ground. I accepted GenoMax's answer mostly because that output file provided a very flexible way for me extract the metrics I was looking for.

ADD REPLYlink written 3 months ago by b10hazard20

You are able to "accept" more than one answer so feel free to accept @Kevin's too.

ADD REPLYlink written 3 months ago by genomax49k

No problem, b10hazard - that's the nature of the game here. It's not a competition to see who can have the most accepted answers. I was actually about to say that you should accept the answer of GenoMax because it was a greater fit for your question. GenoMax is also much more experienced than I.

Thanks for the diplomacy GenoMax :)

ADD REPLYlink written 3 months ago by Kevin Blighe21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour