I have several lanes of paired-end Illumina RNA-Seq data in mouse, but in some lanes, less than 20% of reads map to known exons, indicating that much of the other 80% is likely contamination by genomic DNA (rather than cDNA derived from RNA). On the other end, the "best" sample has almost 80% of reads mapping to known exons.
What is a typical value for fraction of genomic contamination in an RNA-Seq dataset? Can I do anything useful with a lane of RNA-Seq where 80% of the reads aren't RNA-derived? How about 50%? 30%? 20%? I was hoping to use these to study alternative splicing, but I assume that the genomic reads would cause many false-positive cases of intron inclusion and alternative 3' and 5' splice sites. Could I still study other types of splicing events such as exon skipping and cassette exons, since these types of splicing variations would result in long insert lengths that would not be confused with genomic DNA?