Often times I find myself checking if a mapped region overlaps with known regions of the genome. To do this, I use a set of genes that includes merged transcripts from UCSC, Ensembl, Refseq, Gencode, and Vegagene.
Usually this works just fine, but now I am looking for atypical types of transcripts such as siRNAs, lincRNAs, and all small RNA types. I'm not sure if the above annotations are comprehensive enough.
My questions to you are:
Can we (as a community) create a list of resources/websites where we can gather these genes?
For RefSeq, I would use NCBI's website (the creator of RefSeq) and download it from that FTP instead of UCSC. The thing is that UCSC re-aligns RefSeqs and these models differ from the original ones.
The original RefSeq alignments are done using manual curation of automatic Gnomon models that come from a very powerful Genome Annotation Pipeline, aka Gpipe, that is used for eukaryotic and now prokaryotic annotation, http://www.ncbi.nlm.nih.gov/genome/annotation_euk/ and http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and its Splign aligner. Gpipe takes into consideration different sorts of data, including curation. Therefore Gnomon models as well as manually curated RefSeq models are of good quality.
UCSC takes RefSeq sequences and re-aligns them to the genomes using BLAT which is not as powerful as Gpipe/Gnomon/Splign. The cause of the most problems is that exons with indels are converted into two exons with micro-introns in the middle.
Through the Gencode project, Ensembl now incorporates the manual gene annotation provided by Vega/Havana into the automatic annotation. For most cases the data is the same between Ensembl (fetched via API, database or BioMart) and Gencode (fetched from the FTP site or from UCSC). Current differences are that Gencode excludes the haplotype annotation and adds pseudogene models from the Yale and UCSC ENCODE groups. The UCSC "2way Pseudogenes" track provides those additional models where these two sets agree.
RefSeq models are incorporated in the Ensembl and Havana gene build processes. The different small RNA gene types are included in the Ensembl set.
Access to the gene set is also described here, but if I find it most convenient to use the Ensembl Perl API access.