Question

How To Resolve The Gene To Mrnas Relationships For The Human Genome?

2

Entering edit mode

10.8 years ago

Daniel Standage 4.1k

I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.

I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.

It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run grep Q15005 knownGene.txt, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information.
I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.

I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?

human annotation • 3.2k views

ADD COMMENT • link updated 10.8 years ago by Devonj ▴ 90 • written 10.8 years ago by Daniel Standage 4.1k

0

Entering edit mode

Part of the solution to this problem is to realize that identifiers such as Q15005 are UniProt accession numbers and therefore not suitable as a "gene_id". Ensembl/BioMart is a good solution as outlined in Michael's answer.

ADD REPLY • link 10.8 years ago by Neilfws 49k

score 3 · Answer 1 · 2013-06-19

3

Entering edit mode

10.8 years ago

Michael 54k

How about Ensembl Biomart? something like this query maybe?

Btw, it seems like the exported URL contain a session ID, that way I can see what you are selecting ;) I think this is a bug. Just remove the sessionid between martview/ and ?. The following link is the real starting point and should also work after the session expires: http://www.ensembl.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=....

ADD COMMENT • link 10.8 years ago by Michael 54k

0

Entering edit mode

+1. Adding gene and transcript biotype helps to discriminate between protein-coding and non-protein-coding genes/transcripts. I can also add exon ID, but it doesn't look like I can select exon coordinates, nor is any CDS information available. So although this query seems to solve the main problem I've been having (gene-mRNA relationships), it doesn't provide some of the other basic information.

ADD REPLY • link 10.8 years ago by Daniel Standage 4.1k

0

Entering edit mode

If you choose "Sequences" instead of "Structures" you can select various kind of sequences including "Coding sequence" as FASTA files.

ADD REPLY • link 10.8 years ago by Michael 54k

0

Entering edit mode

But this provides the actual sequences, not the annotation/coordinates of those sequences with respect to the genomic sequence.

ADD REPLY • link 10.8 years ago by Daniel Standage 4.1k

0

Entering edit mode

That being said, if I look for "Structures" attributes instead of "Features" attributes, available exon information includes exon and CDS coordinates. So I would have to download and process two separate files, but it seems like everything I need is there. Thanks!

ADD REPLY • link 10.8 years ago by Daniel Standage 4.1k

score 2 · Answer 2 · 2013-06-19

At our company, we were encountering many of the same issues, so we recently switched from UCSC data to the NCBI data found here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh37.p10_top_level.gff3.gz

Or there is more recently updated file here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF_interim/interim_GRCh37.p12_top_level.gff3_2013-04-10.gz

I believe this is the same data that is used to create the embedded chromosome maps you see at dbSNP and NCBI Gene (e.g. http://www.ncbi.nlm.nih.gov/gene/1080 )

It's more than just protein-coding genes, and requires a bit of parsing to get it into a format like UCSC, but for the most part quite good. (I could expound on a few issues we found if needed)

score 1 · Answer 3 · 2013-06-19

1

Entering edit mode

10.8 years ago

Devon Ryan 104k

Don't use the UCSC GTF file, download the one from Ensembl as it doesn't have the non-unique gene_id issue. I understand why the UCSC annotation is the way it is, but that makes things annoying in situations like yours (or when one tries to use DEXSeq on RNAseq reads, for the same reasons).

ADD COMMENT • link 10.8 years ago by Devon Ryan 104k

0

Entering edit mode

I would suggest that this would be much more appropriate as a comment on Michael Dondrup's answer rather than an answer on its own.

ADD REPLY • link 10.8 years ago by Daniel Standage 4.1k