How To Resolve The Gene To Mrnas Relationships For The Human Genome?
3
2
Entering edit mode
8.9 years ago

I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.

I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.

• It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run grep Q15005 knownGene.txt, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information.
• I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
• Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.

I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?

human annotation • 2.7k views
0
Entering edit mode

Part of the solution to this problem is to realize that identifiers such as Q15005 are UniProt accession numbers and therefore not suitable as a "gene_id". Ensembl/BioMart is a good solution as outlined in Michael's answer.

3
Entering edit mode
8.9 years ago

How about Ensembl Biomart? something like this query maybe?

Btw, it seems like the exported URL contain a session ID, that way I can see what you are selecting ;) I think this is a bug. Just remove the sessionid between martview/ and ?. The following link is the real starting point and should also work after the session expires: http://www.ensembl.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=....

0
Entering edit mode

+1. Adding gene and transcript biotype helps to discriminate between protein-coding and non-protein-coding genes/transcripts. I can also add exon ID, but it doesn't look like I can select exon coordinates, nor is any CDS information available. So although this query seems to solve the main problem I've been having (gene-mRNA relationships), it doesn't provide some of the other basic information.

0
Entering edit mode

If you choose "Sequences" instead of "Structures" you can select various kind of sequences including "Coding sequence" as FASTA files.

0
Entering edit mode

But this provides the actual sequences, not the annotation/coordinates of those sequences with respect to the genomic sequence.

0
Entering edit mode

That being said, if I look for "Structures" attributes instead of "Features" attributes, available exon information includes exon and CDS coordinates. So I would have to download and process two separate files, but it seems like everything I need is there. Thanks!

2
Entering edit mode
8.9 years ago
Devonj ▴ 90

At our company, we were encountering many of the same issues, so we recently switched from UCSC data to the NCBI data found here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh37.p10_top_level.gff3.gz

Or there is more recently updated file here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF_interim/interim_GRCh37.p12_top_level.gff3_2013-04-10.gz

I believe this is the same data that is used to create the embedded chromosome maps you see at dbSNP and NCBI Gene (e.g. http://www.ncbi.nlm.nih.gov/gene/1080 )

It's more than just protein-coding genes, and requires a bit of parsing to get it into a format like UCSC, but for the most part quite good. (I could expound on a few issues we found if needed)

0
Entering edit mode

+1 I was intending to generate GFF3 files anyway, so that's is a bonus. Discriminating between protein-coding genes and other genes looks straightforward. There are 4 or 5 mRNAs with issues (overlapping or adjacent exons), but these are easy to ignore.

1
Entering edit mode
8.9 years ago

Don't use the UCSC GTF file, download the one from Ensembl as it doesn't have the non-unique gene_id issue. I understand why the UCSC annotation is the way it is, but that makes things annoying in situations like yours (or when one tries to use DEXSeq on RNAseq reads, for the same reasons).

0
Entering edit mode

I would suggest that this would be much more appropriate as a comment on Michael Dondrup's answer rather than an answer on its own.