Question

RNA-seq: Adjust library size using total uniquely mappable tags or only reads that map to genes analyzed?

1

Entering edit mode

9.4 years ago

Jason ▴ 920

Hello,

I'm using edgeR to analyze RNA-seq data and I have a question about library size adjustment. There are essentially two ways to adjust the output for the differences in library sizes between each of the RNA-seq samples. One is to only use counts that are from genes, which are in the R matrix I'm analyzing. The other is to make the library sizes the total number of uniquely mappable reads which would include counts that although are not being analyzed in the matrix, are a part of the experiment (i.e. reads that map to intergenic regions).

The difference is very small between the two amounts (less than 5% difference) and the results (logFC among significant genes) are almost identical in my case (R^2 > 0.99). I can make a case to use either version: on the one hand I'm only analyzing reads that are in the matrix (so I should use the sum of the column for each sample in the matrix), but on the other hand it seems intuitive to use the total uniquely mappable counts per library since even though they are not in the matrix, they were identified in the experiment.

I was wondering what the community thinks is most appropriate? Or does it not really matter in my case since the outcomes are so similar? Would it just be best to choose one and mention the other briefly if this were in a paper?

Thanks

edgeR RNA-Seq • 6.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Jason ▴ 920

0

Entering edit mode

It's not only about intergenic reads but also about intronic. I'd imagine that normalizing by total uniquely mappable reads (including intronic) is a bad idea if you are comparing samples that differ wrt the amount of incompletely spliced mRNA.

ADD REPLY • link 9.4 years ago by Christian ★ 3.0k

score 1 · Answer 1 · 2014-12-26

1

Entering edit mode

9.4 years ago

Devon Ryan 104k

You usually won't get much of a difference either way. The only benefit of actually using the total uniquely mapped reads is that if you have counts inflated due to DNA contamination then perhaps that'd be handled slightly better. Realistically, I doubt it'd make any actual difference, since no one uses pure library-size normalization for anything (the edgeR and DESeq2 methods are superior) and the contamination should be essentially even across genes (i.e., it will be a scaling factor that would already be incorporated into the normalization). I imagine that single-cell sequencing might be an exception to this, since there you're using spike-ins for normalization.

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

On second thought, DNA contamination would be an offset rather than a scaling, which the typical normalization mechanisms wouldn't fully compensate for. Having said that, if these are large effects then you have to wonder if any sort of normalization would really help, since there are likely MANY other differences to worry about (sequencing is cheap, just toss a sample or two if they're crap).

ADD REPLY • link 9.4 years ago by Devon Ryan 104k