Question: High level of duplicate in one reads of paired-end data
gravatar for olifds
4.7 years ago by
United States
olifds10 wrote:


We are doing some transcriptomic analysis on bovine immune blood cells and we seem's to have some problem with high levels of duplicate in our data. Our library were prepared with the illumina tru-seq stranded kit.

First, we sequence 50pb single-end to test our librairies and we found a high level of duplicated reads( ~80%) and around 2% of A's and T's stretch. We tough the problem was our librairies so we sent the RNA to our sequencing facility so they can do all the work (except RNA extraction)

So the sequencing facility did the library and the sequencing. To be sure that our data would be usable we sequence these datasets 100pb paired-end (3 sample in total). To our surprise, the high level of duplicate we saw in our first sequencing experiment was back but only in one read and the same read for all three sample. The other read was aroud 10% of duplicate in each sample.

Since our RNA seem ok when tested on Agilent bioanalyser technologies (RIN ­>9) and that the libraries were prepared by a sequencing facility of confidence, I'm here to ask you what could be wrong with our data? 

Thanks a lot!

rna-seq paired-end duplicate • 1.9k views
ADD COMMENTlink modified 4.7 years ago by Devon Ryan89k • written 4.7 years ago by olifds10
gravatar for Devon Ryan
4.7 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

There's quite possibly nothing wrong with your data/libraries. One expects a good bit (>50% is pretty common) of duplication in RNAseq datasets. Particularly if you didn't rRNA deplete, your duplication rate could be much higher. It is, however, somewhat odd that only one read in a pair showed such a high duplication rate where the other did not. Is this an rRNA or tRNA sequence?

ADD COMMENTlink written 4.7 years ago by Devon Ryan89k

In fact, we do a rRNA depletion with the Ribozero kit and we got insignificant number of rRNA reads in our data. So we sequence not just tRNA but all except rRNA. 

Maybe I wasn't clear enough. But the high level of duplicate we see is in the dataset of all's of the #2 read (same illumina primer) of the experiment. The datasets of all the first reads is great with 10% duplicate and not overrepresented sequence. 

When analysed as a whole the paired-end data get a 30% level of duplicate. 


ADD REPLYlink written 4.7 years ago by olifds10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1859 users visited in the last hour