Big difference in estimated duplicate reads between forward and reverse of paired-end RNA-seq
0
2
Entering edit mode
2.9 years ago
Eric Lim ★ 2.0k

Our routine QC procedures include using fastqc to estimate duplicate reads. Some recently added datasets caught my attention. We noticed a subset of these samples have wildly different estimated duplicated reads in each end. What could be the issue here?

A related post: High level of duplicate in one reads of paired-end data

DupR1R2
173.30%38.50%
272.50%42.80%
372.40%40.00%
471.90%40.60%
571.90%39.50%
RNA-Seq duplication fastqc • 1.8k views
2
Entering edit mode

Is there anything special about these samples from the wetlab part? Which kit was used and what species is that?

1
Entering edit mode

Are these from patterned flowcells (Hiseq 4000/NovaSeq)?

0
Entering edit mode

They're relatively older dataset from HiSeq 2000. While patterned flowcells tend to generate more dupes, I'm not sure what is so special about R2?

Added: We looked into other fastq parameters (GC contents, adapters, overrepresented sequences, etc) and post-alignment (mapping, skewness, insert sizes, etc), everything else seems normal compared to the rest of the samples in the experiment.

0
Entering edit mode

Are read numbers identical for R1/R2? Off chance that some reads in R2 were filtered out by some processing.

1
Entering edit mode

Is there a quality drop in R2? I've seem large differences in duplicate estimates when R2 is of much lower quality than R1 - the lower duplicate estimates is just an artifact caused by sequencing errors.

BBMap has a nice feature which can help in your situation, the mhist parameter:

mhist=<file>            Histogram of match, sub, del, and ins rates by

1
Entering edit mode

@ATpoint, @genomax, and @h.mom

The data are from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47966 and the discrepancy in estimated duplicates were found in 6 of the 12 human rnaseq samples. Given all the parameters I've looked at, my best guesstimate is that these 12 samples were sequenced in 2 batches. Initially, they probably had n=1 for each of the 6 developmental time points, but reviewers probably asked for replicates, so they sequenced again and published n=2 as technical replicates. I assume the RNAs might've been degraded a bit. Despite HiSeq 2000 was reported as the platform, the second batch might've been sequenced on a different platform.

@h.mom, the 2nd batch produced noticeably lower quality overall, but the quality difference in R1 and R2 seems insignificant. The design for this experiment is 101bp forward and 99bp reverse. I'll try what you suggested to see if BBMap will shed some light.

@genomax, no. While we do some pre-processing internally before alignment, QC was run before those filtering. I also manually check the read numbers and they match.

Whatever it is, I've decided to drop these 6 samples for now. Other than to satisfy my own curiosity, figuring out what happens to the discrepancy is not prioritized at the moment.

0
Entering edit mode

Hi @ericlim. Were u able to find out the reason behind this discrepancy. I also have a data like this (however it's Selective whole genome amplification). The R1 have more duplicates than R2 in fastqc both pre and post primming.

1
Entering edit mode

No clue. I actually haven't thought much about this since I wrote the summary. We've substantially improved our QC pipeline. Will ask the team to include these samples back and see if we notice anything new.

0
Entering edit mode

Thanks. I shall be waiting for your input. No one has answered this behaviour so far.