Question

High level of duplicate in one reads of paired-end data

2

Entering edit mode

9.8 years ago

olifds ▴ 20

Hi,

We are doing some transcriptomic analysis on bovine immune blood cells and we seem's to have some problem with high levels of duplicate in our data. Our library were prepared with the illumina tru-seq stranded kit.

First, we sequence 50pb single-end to test our librairies and we found a high level of duplicated reads( ~80%) and around 2% of A's and T's stretch. We tough the problem was our libraries so we sent the RNA to our sequencing facility so they can do all the work (except RNA extraction)

So the sequencing facility did the library and the sequencing. To be sure that our data would be usable we sequence these datasets 100pb paired-end (3 sample in total). To our surprise, the high level of duplicate we saw in our first sequencing experiment was back but only in one read and the same read for all three sample. The other read was aroud 10% of duplicate in each sample.

Since our RNA seem ok when tested on Agilent bioanalyser technologies (RIN >9) and that the libraries were prepared by a sequencing facility of confidence, I'm here to ask you what could be wrong with our data?

Thanks a lot!
Olivier.

duplicate paired-end RNA-Seq • 3.9k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by olifds ▴ 20

Ram · Answer 1 · 2014-07-21

1

Entering edit mode

9.8 years ago

Devon Ryan 104k

There's quite possibly nothing wrong with your data/libraries. One expects a good bit (>50% is pretty common) of duplication in RNAseq datasets. Particularly if you didn't rRNA deplete, your duplication rate could be much higher. It is, however, somewhat odd that only one read in a pair showed such a high duplication rate where the other did not. Is this an rRNA or tRNA sequence?

ADD COMMENT • link 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

In fact, we do a rRNA depletion with the Ribozero kit and we got insignificant number of rRNA reads in our data. So we sequence not just tRNA but all except rRNA.

Maybe I wasn't clear enough. But the high level of duplicate we see is in the dataset of all's of the #2 read (same illumina primer) of the experiment. The datasets of all the first reads is great with 10% duplicate and not overrepresented sequence.

When analysed as a whole the paired-end data get a 30% level of duplicate.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by olifds ▴ 20