I am looking at RNA-seq data, which I have little experience in. I notice that for many genes, there are reliable alignments (i.e. with high mapping quality) to introns. I understand that some of them are due to unannotated transcripts, but in many regions, this does not seem to be the major cause. The intronic read hits do not seem to be purely caused by alignments artifacts, either, because the pattern is tissue specific (though this is not a compelling evidence). Another possible explanation is that this observation is due to noisy transcripts (Pickrell et al, 2010), but this seems to be a big effect: for some long genes, there are far more intronic hits than exonic hits.
I guess those who study RNA-seq data must have noticed the intronic hits for years. What is cause of the large amount of intronic read hits? Is it caused by alignment/library prep artifacts or noisy transcription? Are there papers addressing this? Thanks.
EDIT: my conclusion. I was looking at ERR030882 from Illumina BodyMap (brain). The sample were processed with oligo-dT. I am using the gencode exon annotations, including all the pseudogenes, lincRNA and known processed transcripts, totalling ~112Mbp. The initial analysis reveals ~80% of bases mapped to exons. Nonetheless, if I only look at read pairs with insert size larger than 311bp (~10% of the original data), 98.2% of these spliced read pairs are mapped to known exons, suggesting that the vast majority of the intronic and intergenic read pairs are unspliced. It is possible that some unspliced pairs come from unknown single-exon transcripts with intact polyA tail, but contaminations seem the leading cause overall.