Question: How To Resolve The Gene To Mrnas Relationships For The Human Genome?
gravatar for Daniel Standage
7.4 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.

I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.

  • It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run grep Q15005 knownGene.txt, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information.
  • I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
  • Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.

I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?

annotation human • 2.3k views
ADD COMMENTlink modified 7.4 years ago by Devonj90 • written 7.4 years ago by Daniel Standage3.9k

Part of the solution to this problem is to realize that identifiers such as Q15005 are UniProt accession numbers and therefore not suitable as a "gene_id". Ensembl/BioMart is a good solution as outlined in Michael's answer.

ADD REPLYlink written 7.3 years ago by Neilfws49k
gravatar for Michael Dondrup
7.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

How about Ensembl Biomart? something like this query maybe?

Btw, it seems like the exported URL contain a session ID, that way I can see what you are selecting ;) I think this is a bug. Just remove the sessionid between martview/ and ?. The following link is the real starting point and should also work after the session expires:

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Michael Dondrup47k

+1. Adding gene and transcript biotype helps to discriminate between protein-coding and non-protein-coding genes/transcripts. I can also add exon ID, but it doesn't look like I can select exon coordinates, nor is any CDS information available. So although this query seems to solve the main problem I've been having (gene-mRNA relationships), it doesn't provide some of the other basic information.

ADD REPLYlink written 7.4 years ago by Daniel Standage3.9k

If you choose "Sequences" instead of "Structures" you can select various kind of sequences including "Coding sequence" as FASTA files.

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Michael Dondrup47k

But this provides the actual sequences, not the annotation/coordinates of those sequences with respect to the genomic sequence.

ADD REPLYlink written 7.4 years ago by Daniel Standage3.9k

That being said, if I look for "Structures" attributes instead of "Features" attributes, available exon information includes exon and CDS coordinates. So I would have to download and process two separate files, but it seems like everything I need is there. Thanks!

ADD REPLYlink written 7.4 years ago by Daniel Standage3.9k
gravatar for Devonj
7.4 years ago by
Berkeley, CA
Devonj90 wrote:

At our company, we were encountering many of the same issues, so we recently switched from UCSC data to the NCBI data found here:

Or there is more recently updated file here:

I believe this is the same data that is used to create the embedded chromosome maps you see at dbSNP and NCBI Gene (e.g. )

It's more than just protein-coding genes, and requires a bit of parsing to get it into a format like UCSC, but for the most part quite good. (I could expound on a few issues we found if needed)

ADD COMMENTlink written 7.4 years ago by Devonj90

+1 I was intending to generate GFF3 files anyway, so that's is a bonus. Discriminating between protein-coding genes and other genes looks straightforward. There are 4 or 5 mRNAs with issues (overlapping or adjacent exons), but these are easy to ignore.

ADD REPLYlink written 7.4 years ago by Daniel Standage3.9k
gravatar for Devon Ryan
7.4 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Don't use the UCSC GTF file, download the one from Ensembl as it doesn't have the non-unique gene_id issue. I understand why the UCSC annotation is the way it is, but that makes things annoying in situations like yours (or when one tries to use DEXSeq on RNAseq reads, for the same reasons).

ADD COMMENTlink written 7.4 years ago by Devon Ryan97k

I would suggest that this would be much more appropriate as a comment on Michael Dondrup's answer rather than an answer on its own.

ADD REPLYlink written 7.4 years ago by Daniel Standage3.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1202 users visited in the last hour