I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.
I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.
- It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run
grep Q15005 knownGene.txt, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information.
- I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
- Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.
I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?