Question

RNA-seq: Should remove duplicates in all samples of same experiment although some do not have technical duplicates?

0

Entering edit mode

5.8 years ago

salamandra ▴ 550

1- The graph of % duplication vs read coverage of RNA-seq sample 1 seems to have technical duplicates, but can we say the same for sample 2 and sample 3?

2- If we remove duplicates in one sample (because the graph % duplication vs read coverage suggests there's technical duplicates) should we remove in all the other samples of the same experiment although those samples do not seem to have technical duplicates? Or is it 'unfair' to compare samples with/without duplicates?

3- In the graphs above, what should be the values (high/low) of Int, SI, vertical green line and vertical red line for the sample to have technical duplicates?

RNA-Seq duplication dupRadar • 3.6k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.8 years ago by salamandra ▴ 550

1

Entering edit mode

The question of duplicate removal in RNA-seq has been extensively discussed here multiple times. The short answer is: No, do not remove any duplicates. For the long answer, please use the search function and scan the answers.

ADD REPLY • link 5.8 years ago by ATpoint 81k

1

Entering edit mode

Be very careful about messing with "duplicates". Looks like you don't want to heed the notes from last thread mRNA-seq quality report (fastQC): Does it mean samples have adapters and should remove duplicates? .

Unless you had UMI (unique molecular indexes that label each original RNA fragment) it is going to be impossible to determine if you really have PCR duplication.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

yes, but then i read the link you had provided and understood that at least for the samples where the duplication level vs reads per kilobase graph was 'ugly' we could infer most of the duplicates were technical. And as some of my samples had those 'ugly' graphs i had the question whether we could for those samples remove duplicates and still compare them with the 'good' ones or just discard those samples?

ADD REPLY • link 5.8 years ago by salamandra ▴ 550

score 0 · Answer 1 · 2018-08-06

1) sample 1 seems really bad, with lots of technical replicates, sample 2 seems fine, and sample 3 seems to have a moderate to high level of technical duplicates.

2) The Libraries can contain technical duplication post is very clear about advising not to remove duplicates from RNAseq, and not to remove duplicates from samples with different levels of technical duplicates.

[...] to completely remove the duplication. The problem with this approach is that it isn’t able to distinguish biological from technical duplication and both are removed. In samples where an even read coverage is expected and the depth of sequencing hasn’t come close to saturating this then this is a reasonable approach, but in samples with variable read densities this will have the effect of capping the maximum read density able to be obtained, and limiting the dynamic range able to be obtained. If multiple samples with different amounts of technical duplication are deduplicated in this way then you will actually introduce differences where the don’t exist.

3) I have no idea.

An additional consideration: in my experience, when samples subjected to the same treatment have different levels of technical duplicates, it means problems with RNA extraction or library preparation, meaning the technical duplicates are just the observable side of a deeper problem. If follow up with downstream analyses like gene quantification and differential expression, when performing a PCA the samples with high proportion of technical duplicates will not cluster together with other samples from the same treatment, but will probably be scattered all over.