Comprehensive Gene Sets
Entering edit mode
10.9 years ago
Sequencegeek ▴ 740

Often times I find myself checking if a mapped region overlaps with known regions of the genome. To do this, I use a set of genes that includes merged transcripts from UCSC, Ensembl, Refseq, Gencode, and Vegagene.

Usually this works just fine, but now I am looking for atypical types of transcripts such as siRNAs, lincRNAs, and all small RNA types. I'm not sure if the above annotations are comprehensive enough.

My questions to you are:

Can we (as a community) create a list of resources/websites where we can gather these genes?

How do you create comprehensive gene sets?

Here is a working list:

General Annotations:
UCSC knownGene
Refseq (via UCSC)
Gencode (via UCSC)
Vegagene (via UCSC)










gene annotation database • 4.8k views
Entering edit mode
6.8 years ago

For RefSeq, I would use NCBI's website (the creator of RefSeq) and download it from that FTP instead of UCSC. The thing is that UCSC re-aligns RefSeqs and these models differ from the original ones.

The original RefSeq alignments are done using manual curation of automatic Gnomon models that come from a very powerful  Genome Annotation Pipeline, aka Gpipe, that is used for eukaryotic and now prokaryotic annotation, and  and its Splign aligner. Gpipe takes into consideration different sorts of data, including curation. Therefore Gnomon models as well as manually curated RefSeq models are of good quality.

UCSC takes RefSeq sequences and re-aligns them to the genomes using BLAT which is not as powerful as Gpipe/Gnomon/Splign. The cause of the most problems is that exons with indels are converted into two exons with micro-introns in the middle.

Entering edit mode

Re: "UCSC re-aligns RefSeqs and these models differ from the original ones."... I investigated this pretty thoroughly last year. The two essential results are: 1) ~886 transcripts have *significant* genome coordinate discrepancies between RefSeq (splign) and UCSC (BLAT); and 2) When there's a discrepancy, splign's alignments are more often more parsimonious than BLAT's based purely on sequence identity (roughly 30:1 bias). Details in this slideshare deck: Data current as of Feb 2014.

Entering edit mode

Yes, there is a difference due to obvious reasons. I agree with your conclusions, Reece. Also, thank you for sharing your Slideshare link and HGVS code, very interesting indeed!

Entering edit mode
10.9 years ago

Ensembl also contains information on small RNAs in addition to transcripts. For instance, this BioMart example query retrieves locations for several small RNA types:

Ensembl Transcript ID   Chromosome Name Transcript Start (bp)   Transcript End (bp) Transcript Biotype
ENST00000508921 17  74730199    74733413    lincRNA
ENST00000508979 17  75464643    75468852    lincRNA
ENST00000510620 17  75543023    75559325    lincRNA
ENST00000510484 17  75554224    75559074    lincRNA
ENST00000504504 17  75718954    75724641    lincRNA
ENST00000507040 17  77889984    77900524    lincRNA
ENST00000505044 17  78313698    79329659    lincRNA
ENST00000501711 17  78775440    78779420    lincRNA
ENST00000499078 17  79604197    79606203    lincRNA
ENST00000500627 17  79885705    79888628    lincRNA
Entering edit mode

Thanks for the response, Ensembl definitely makes it easy to retrieve a list of specific type of transcript.

However, It's my understanding that Ensembl and UCSC are incomplete. I'm not entirely sure how UCSC and Ensembl construct their annotations but I believe that, for instance, the genes from are not all annotated in Ensembl. Does anyone have an idea how much of them are and why?

Entering edit mode

Ensembl does have documentation about the techniques they use. Here are the details for small RNAs: For pseudogenes, look under the 'Other transcripts' section here: This is different than the Gerstein approach: In general, I wouldn't characterize either as incomplete; they are just different approaches to an unsolved problem.

Entering edit mode
10.7 years ago
Felix ▴ 50

One thing that many people might not realize is the relation between Ensembl, Vega/Havana and Gencode.

Through the Gencode project, Ensembl now incorporates the manual gene annotation provided by Vega/Havana into the automatic annotation. For most cases the data is the same between Ensembl (fetched via API, database or BioMart) and Gencode (fetched from the FTP site or from UCSC). Current differences are that Gencode excludes the haplotype annotation and adds pseudogene models from the Yale and UCSC ENCODE groups. The UCSC "2way Pseudogenes" track provides those additional models where these two sets agree.

RefSeq models are incorporated in the Ensembl and Havana gene build processes. The different small RNA gene types are included in the Ensembl set.

Access to the gene set is also described here, but if I find it most convenient to use the Ensembl Perl API access.


Login before adding your answer.

Traffic: 2863 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6