Datasource For Human Generic Exons?
5
3
Entering edit mode
13.2 years ago

I wish to compare how sets of values differ on average along human genes. My fear is that I may get artefacts based on splice variants, namely exons not present in all transcripts.

Does anyone know of a dataset of human exons present in all currently known transcripts of their associated gene, or have code already written to calculate this? Ideally I need their ncbi36 genomic coordinates as well.

EDIT:

I got a response back from the nice folks at the ensembl help desk. Its seems that this data is both computed and available.

[From BIOMART] Attributes will be the "structures" page. (Expand EXON, and choose "constitutive exon". "1" in the results table means yes, "0" means no, not constitutive)

And also confirmed that constituative in this instance refers to :

"is based on transcripts for one particular gene (in a species)."

exon transcript • 4.4k views
ADD COMMENT
0
Entering edit mode

Hmm...I tried this with some well-studied genes but am not getting any constitutive exons--I get all zeros. Can you show me a sample gene that has what you are looking for? I must be structuring the query wrong. This is an interesting question, btw.

ADD REPLY
0
Entering edit mode

From the biomart web site I downloaded the latest human data, getting the attributes: ensembl id, transcript id and constitutive exon. This is not computed for NCBI36 in the archive so I will use liftover to convert coordinates

Some numbers: 1071242 exons of which 73891 are constitutive that have any constitutive exon: 16643 genes and 20743 transcripts.

ADD REPLY
0
Entering edit mode

From the biomart web site I downloaded the latest human data, getting the attributes: ensembl id, transcript id and constitutive exon. This is not computed for NCBI36 in the archive so I will use liftover to convert coordinates Some numbers: 1071242 exons of which 73891 are constitutive. Those that have any constitutive exon: 16643 genes and 20743 transcripts.

ADD REPLY
4
Entering edit mode
13.2 years ago

Your answer cannot be calculated from a list of exon start-stop positions for a cannonical sequence. What you are asking is, "if there are 5 known exons for a gene, which are the exons which are always transcribed?". "Always" is the sticking point. If exons 1,2,3,5 are usually transcribed, but 4 is a cassette exon which is often excised, there are alternate transcription start sites in 1, and an exon 6 is reported between 2 and 3 in testes but not liver, the answer starts getting pretty complicated. This is a realistic example.

I don't have a complete answer for you, but you might want to start at the EBI Alternate Splicing Database: http://www.ebi.ac.uk/asd.

ADD COMMENT
0
Entering edit mode

Thanks a much better phrasing of my problem... and opening the can of worms I hoped to gloss over. I am primarily interested in differences between 5' and 3' expression ratios. Maybe I can still get away with just using the ensembl exon definitions and ignoring exons with a low transcript count. I think ensembl use asd in their pipeline

ADD REPLY
4
Entering edit mode
13.2 years ago

David is on the right track. It looks like the (now-defunct) successor to ASD called ASTD may have what you need: ftp://ftp.ebi.ac.uk/pub/databases/astd/current_release/human/9606_events.gff.gz

In this gff, each exon from Ensembl Human v41 on NCBI 36 is classified as:

  • Cassette Exon
  • Intron Isoform
  • Exon Isoform
  • Mutually Exclusive

It does not seem to have consituitive exons, which is what you are looking for, but subtracting the exon IDs in this file from the Ensembl Human v41 exon IDs may give you what you want.

More information on the procedures used to generate this dataset can be found here. Interestingly a section on AEdb in this doc mentions something that supports your concern:

Statistical analysis of the dataset shows the length of constitutive exons follows a normal distribution; the distribution of alternative exons is skewed toward smaller ones.

ADD COMMENT
0
Entering edit mode

thanks. This looks like what I was after. I'll ask the ensembl desk to see if there is a non-defunct equivalent.

ADD REPLY
2
Entering edit mode
13.2 years ago
Mary 11k

Hmmm...constitutive exons. I don't know that anyone has done that. I also don't know if it's a valid assumption. As much as data as we have today (at least for human, even less for other species) we don't have a grasp of that many developmental time points (especially in human), other situations like wound repair, gender-specific issues, etc. Also, what we have in the databases may be from culture cells, often cancer-based or cancer-like to enable continuous growth.

But: I know the Map Viewer tool at NCBI has a display of the gene that I sometimes like to consider. They compress all known exons to a single display--like a summary. I actually think seeing the values across a summary diagram like that might be informative. But maybe not what you want.

Another exon assessment tool I always really liked was Model Maker at NCBI to get a sense of what the exon patterns are. I hope this link works, but not sure: http://www.ncbi.nlm.nih.gov/projects/mapview/modelmaker.cgi?taxid=9606&contig=NT_010718.16&gene=TP53 If you can see that, you'll see that some are partial too.

I'll keep thinking about that.

EDIT: this is an interesting discussion of constitutive exons: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2077895/ So they had a dataset that might be informative for you.

ADD COMMENT
2
Entering edit mode
ADD COMMENT
1
Entering edit mode
13.2 years ago

This information is available from the UCSC mysql database (available for download here):

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18  -e 'select * from knownGene as K ,kgXref as X where  K.name=X.kgId limit 5\G'

*************************** 1. row ***************************
       name: uc001aaa.2
      chrom: chr1
     strand: +
    txStart: 1115
      txEnd: 4121
   cdsStart: 1115
     cdsEnd: 1115
  exonCount: 3
 exonStarts: 1115,2475,3083,
   exonEnds: 2090,2584,4121,
  proteinID: 
    alignID: uc001aaa.2
       kgID: uc001aaa.2
       mRNA: BC032353
       spID: 
spDisplayID: 
 geneSymbol: BC032353
     refseq: 
    protAcc: 
description: Homo sapiens cDNA FLJ36366 fis, clone THYMU2007824.
*************************** 2. row ***************************
       name: uc009vip.1
      chrom: chr1
     strand: +
    txStart: 1115
      txEnd: 4272
   cdsStart: 1115
     cdsEnd: 1115
  exonCount: 2
 exonStarts: 1115,2475,
   exonEnds: 2090,4272,
  proteinID: 
    alignID: uc009vip.1
       kgID: uc009vip.1
       mRNA: AX748260
       spID: 
spDisplayID: 
 geneSymbol: AX748260
     refseq: 
    protAcc: 
description: Homo sapiens cDNA FLJ36366 fis, clone THYMU2007824.
*************************** 3. row ***************************
       name: uc009vjg.1
      chrom: chr1
     strand: +
    txStart: 19417
      txEnd: 20957
   cdsStart: 19417
     cdsEnd: 19417
  exonCount: 3
 exonStarts: 19417,20426,20838,
   exonEnds: 19902,20530,20957,
  proteinID: 
    alignID: uc009vjg.1
       kgID: uc009vjg.1
       mRNA: BC048429
       spID: 
spDisplayID: 
 geneSymbol: BC048429
     refseq: 
    protAcc: 
description: Homo sapiens cDNA clone IMAGE:5275617, **** WARNING: chimeric clone ****.
*************************** 4. row ***************************
       name: uc001aal.1
      chrom: chr1
     strand: +
    txStart: 58953
      txEnd: 59871
   cdsStart: 58953
     cdsEnd: 59871
  exonCount: 1
 exonStarts: 58953,
   exonEnds: 59871,
  proteinID: Q8NH21
    alignID: uc001aal.1
       kgID: uc001aal.1
       mRNA: NM_001005484
       spID: Q8NH21
spDisplayID: OR4F5_HUMAN
 geneSymbol: OR4F5
     refseq: NM_001005484
    protAcc: NP_001005484
description: olfactory receptor, family 4, subfamily F,
*************************** 5. row ***************************
       name: uc009vjh.1
      chrom: chr1
     strand: +
    txStart: 55424
      txEnd: 59692
   cdsStart: 58953
     cdsEnd: 59691
  exonCount: 3
 exonStarts: 55424,55751,58899,
   exonEnds: 55436,55834,59692,
  proteinID: Q52R92
    alignID: uc009vjh.1
       kgID: uc009vjh.1
       mRNA: AY972817
       spID: Q52R92
spDisplayID: Q52R92_HUMAN
 geneSymbol: OR4F5
     refseq: NM_001005484
    protAcc: NP_001005484
description: olfactory receptor, family 4, subfamily F,
ADD COMMENT
0
Entering edit mode

I may be missing something here but is this not just the exon stop start positions? I know my answer can be calculated from them but it seems like a question that should have been looked at before.

ADD REPLY
0
Entering edit mode

the exon start/end positions are in exonStarts and exonEnds, and yes, you would have to calculate the answer :-)

ADD REPLY

Login before adding your answer.

Traffic: 3059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6