Question: RNA-seq: Adjust library size using total uniquely mappable tags or only reads that map to genes analyzed?
gravatar for Jason
5.9 years ago by
United States
Jason900 wrote:



I'm using edgeR to analyze RNA-seq data and I have a question about library size adjustment. There are essentially two ways to adjust the output for the differences in library sizes between each of the RNA-seq samples. One is to only use counts that are from genes, which are in the R matrix I'm analyzing. The other is to make the library sizes the total number of uniquely mappable reads which would include counts that although are not being analyzed in the matrix, are a part of the experiment (i.e. reads that map to intergenic regions).

The difference is very small between the two amounts (less than 5% difference) and the results (logFC among significant genes) are almost identical in my case (R^2 > 0.99). I can make a case to use either version: on the one hand I'm only analyzing reads that are in the matrix (so I should use the sum of the column for each sample in the matrix), but on the other hand it seems intuitive to use the total uniquely mappable counts per library since even though they are not in the matrix, they were identified in the experiment. 

I was wondering what the community thinks is most appropriate?  Or does it not really matter in my case since the outcomes are so similar? Would it just be best to choose one and mention the other briefly if this were in a paper?



edger rna-seq • 5.0k views
ADD COMMENTlink modified 5.9 years ago by Devon Ryan97k • written 5.9 years ago by Jason900
It's not only about intergenic reads but also about intronic. I'd imagine that normalizing by total uniquely mappable reads (including intronic) is a bad idea if you are comparing samples that differ wrt the amount of incompletely spliced mRNA.
ADD REPLYlink written 5.9 years ago by Christian2.9k
gravatar for Devon Ryan
5.9 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

You usually won't get much of a difference either way. The only benefit of actually using the total uniquely mapped reads is that if you have counts inflated due to DNA contamination then perhaps that'd be handled slightly better. Realistically, I doubt it'd make any actual difference, since no one uses pure library-size normalization for anything (the edgeR and DESeq2 methods are superior) and the contamination should be essentially even across genes (i.e., it will be a scaling factor that would already be incorporated into the normalization). I imagine that single-cell sequencing might be an exception to this, since there you're using spike-ins for normalization.

ADD COMMENTlink written 5.9 years ago by Devon Ryan97k

On second thought, DNA contamination would be an offset rather than a scaling, which the typical normalization mechanisms wouldn't fully compensate for. Having said that, if these are large effects then you have to wonder if any sort of normalization would really help, since there are likely MANY other differences to worry about (sequencing is cheap, just toss a sample or two if they're crap).

ADD REPLYlink written 5.9 years ago by Devon Ryan97k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1224 users visited in the last hour