6.7 years ago by
Washington University School of Medicine, St. Louis, USA
In ALEXA-seq, a work that is now arguably deprecated by newer tools, we tried to classify features as 'expressed above background noise levels as follows' (refer to the ALEXA-seq manuscript and supplementary materials for more details):
- We identified thousands of negative control intergenic regions of varying size throughout the genome. These regions were defined by subtracting out known or predicted genes as well as regions with any evidence of expression from mRNAs in genbank or ESTs in dbEST.
- From the set of candidate negative controls, we chose a subset that are most representative of real genes with respect to size and GC content.
- Using these as negative controls we chose the 95th percentile of expression values as an estimate of background noise that you might see from any region regardless of whether it was really expressed. i.e. a cutoff that has a 'rationale' behind it.
- For splicing analysis, the problem is more complex. Say you have some evidence for expression of an intron or novel exon within a known gene. This region may have the same level of noise as any region in the genome. However, it will also have additional noise from expression actually occurring at that locus. You will have unprocessed RNA in your sample that will increase noise in all introns. You will also have stochastic splicing errors. These sources of noise will be correlated with expression level. The more actively transcribed the region, the higher the noise levels. Thus a single cutoff for all loci is inadvisable. For that reason we again chose negative control features, within genes this time, that again have no prior evidence of being expressed in known databases. We then plotted the expression of these controls against expression of the gene they reside within (see Supplementary Figure 5 for an example). We then fit a linear model to that data and used it to derive a sliding background noise cutoff on a gene-by-gene basis. That way a novel exon within a highly expressed locus has to pass a higher bar to be considered real than one in a lowly expressed locus.
If you want to dig into some of the code that implemented these concepts including the code to generate Supplementary Figure 5, you can look here: summarizeExpressionValues, alternativeExpressionDatabase
Related manuscript: Pubmed | Full text | PDF | Supplementary Information | GEO (GSE23776) | News and Views
For a review of tools related to rna-seq expression and splicing analyses you might refer to these posts:
Recommended Tools For Alternative Splicing Detection From Rna-Seq Data
Best Approach To Predict Novel And Alternative Splicing Events From Rna-Seq Data
Is There A "Gold Standard Rnaseq Data" To Compare With Various Rnaseq Tools For Differential Expression Analysis