Until recently, we have used a poly(A) selection process to prepare our RNA-Seq libraries. In our last run we had to use a ribo-depletion approach instead, as we want to study some formalin-fixed (FF) material with degraded RNA. The facility use Illumina's Ribo-Zero kit. We otherwise kept the same sequencing parameters: paired-end 75bp reverse stranded on an Illumina HiSeq 4000.
Since we don't know how well the FF material represents the original tissue, we also sequenced a few frozen tissue samples, with the intention of comparing the two (though they are _not_ perfectly matched). In total we have 3 FF samples and 2 frozen samples.
I ran the reads through my usual pipeline:
- fastQC all looked OK, some highly duplicated sequences, probably rRNA associated, but nothing too major.
- STAR alignment resulted in ~90% reads being uniquely assigned in all cases (similar to our poly(A) samples)
I had STAR run gene counts during alignment. The results differed from what I've typically seen in the poly(A) data in terms of the % of reads that assign to a (unique) gene.
- Poly(A): we usually get 80-85%
- Ribo-depleted FF samples: 24%, 24%, 26%
- Ribo-depleted frozen samples: 58%, 59%
So in both cases the numbers assigned are far lower than for poly(A), and this is especially bad for the FF samples. Most of the reads that were not assigned belonged in the 'no feature' category, i.e. they didn't overlap with any exon.
It occurs to me that this difference is probably due to the larger variety of RNA species: poly(A) should enrich primarily for mRNA, while ribo-depletion leaves in ncRNA species, etc. Therefore fewer reads will be mRNA and fall within an exon for gene counting purposes. I ran ezBAMqc to check the distribution of the aligned reads in the BAM files:
So this seems to agree with my hypothesis: the FF library is dominated by species other than mRNA.
Does this sound like a reasonable explanation? And is it still reasonable to compare the gene counts of the FF and frozen samples? I would normalise for the total number of reads, but is that sufficient?
Thanks for any thoughts.
In response to cpad0112's comment: I dug out a similar plot for one of our poly(A) samples (below). The % intronic reads is indeed much lower.