We are doing some transcriptomic analysis on bovine immune blood cells and we seem's to have some problem with high levels of duplicate in our data. Our library were prepared with the illumina tru-seq stranded kit.
First, we sequence 50pb single-end to test our librairies and we found a high level of duplicated reads( ~80%) and around 2% of A's and T's stretch. We tough the problem was our librairies so we sent the RNA to our sequencing facility so they can do all the work (except RNA extraction)
So the sequencing facility did the library and the sequencing. To be sure that our data would be usable we sequence these datasets 100pb paired-end (3 sample in total). To our surprise, the high level of duplicate we saw in our first sequencing experiment was back but only in one read and the same read for all three sample. The other read was aroud 10% of duplicate in each sample.
Since our RNA seem ok when tested on Agilent bioanalyser technologies (RIN >9) and that the libraries were prepared by a sequencing facility of confidence, I'm here to ask you what could be wrong with our data?
Thanks a lot!