Question: How To Handle Replicates With Huge Differences In Number Of Reads?
gravatar for Rayna
8.4 years ago by
Rayna250 wrote:

Hey everyone,

I have a question regarding the way to handle Illumina single-end RNA-seq data. In fact, I have a few samples, each having 2 biological replicates. Nothing extraordinary so far except the case of one sample. Indeed, one of the replicates generated a low amount of reads (~13 millions) whereas the average number of reads I have for all the other cases is the double (27-30 millions). So, GATC reran this particular sample, but in a very weird way -- my guess is, alone in one lane -- which outcomes 150 millions reads...

This is of course too huge and thus, introduces a discrepancy in the data. I am running out of ideas how to handle it, so I'd really appreciate if someone could help.

Thanks a lot in advance :)

next-gen analysis rna-seq • 2.7k views
ADD COMMENTlink modified 8.1 years ago by Rm8.0k • written 8.4 years ago by Rayna250

Why not work with 20% of your new data? Assuming that there is no specific bias in your new run, it doesn't seem a bad idea.

ADD REPLYlink modified 8.4 years ago • written 8.4 years ago by Leonor Palmeira3.7k

Strong assumption! ;-)

ADD REPLYlink written 8.4 years ago by Manu Prestat4.0k

Thanks for your suggestion. Hum, if you pick them at random, this should be kind of acceptable. I thought of picking ~15 millions by chance, but it sounded very arbitrary...

ADD REPLYlink written 8.4 years ago by Rayna250
gravatar for Stefano Berri
8.4 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

I am not an expert on RNA-seq, but I guess you should normalise the data so that the signal takes into account the overall number of reads (like RPKM). The difference will be in the internal (technical) variation. The more reads the more accurate. However, it is likely that most of the variation comes from the biological replicates, independent on the number of reads you have. You could randomly select a percentage of the reads, but that would mean losing information and precious data. You could do it to check that the results are consistent, but it should not be a major problem.

What is the "discrepancy" in the data?

ADD COMMENTlink written 8.4 years ago by Stefano Berri4.1k

Thanks for your suggestions.

Under discrepancy, I meant the fact that I can have up to 5 times more reads for one of the replicates (150 millions) in comparison to the other (~30 millions).

RPKM gives an estimate of the quantities but if you want to use DESeq/EdgeR, you need the raw counts, not the RPKM. Hence my question since I use DESeq for the analysis.

ADD REPLYlink written 8.4 years ago by Rayna250
gravatar for HMZ Gheidan
8.4 years ago by
HMZ Gheidan30
HMZ Gheidan30 wrote:

In principle, negative-binomial based tests such as that of DESeq and edgeR should be able to deal even with vastly different library sizes. Of course, the huge sample will not add much power because you have so many reads only on one side of the comparison.

If you do a binomial test, it does not matter how you deal with it because the result will be wrong anyway. (See the numerous earlier threads on why Poisson-based tests, and that includes the binomial test, are inadmissible because they ignore biological variability.)

ADD COMMENTlink written 8.4 years ago by HMZ Gheidan30

This is something I hadn't thought about, thanks! So, using DESeq should be ok :) (even without down sampling as proposed by Rm below)

ADD REPLYlink written 8.4 years ago by Rayna250
gravatar for Rm
8.4 years ago by
Danville, PA
Rm8.0k wrote:

Deseq and edgeR should take care of it. If not try down sampling your fastq files.

ADD COMMENTlink written 8.4 years ago by Rm8.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1565 users visited in the last hour