Question

Illumina Mate Pair Read Duplication Level

2

Entering edit mode

12.1 years ago

abc ▴ 40

Hi,

Can anyone please help with some known statistics for illumina mate pair libraries read duplication level. We have a lane of hiseq 8kb mate pair reads (200million genomic reads, 100bp). FastQC shows ~95% duplicated reads and CLC shows ~92%, which indicates extremely high level of duplication! When we pointed it to our sequence providers they said 80~90%duplication in common in 8kb mate pair library. Is it really the case? We understand read duplication can be high in Mate pair libraries, however, if it is 80-90% range, are those remaining only 10-20% unique reads any helpful for projects like denovo assembly(scaffolding, closing gaps etc)? Please shed some light if you've faced or seen issues like this.

Cheers.

illumina qc • 8.1k views

ADD COMMENT • link updated 11.7 years ago by Biomonika (Noolean) 3.2k • written 12.1 years ago by abc ▴ 40

score 2 · Answer 1 · 2013-06-11

My has only just started doing mate-pair sequencing (although 3kb not 8kb) using the Nextera prep kit. The duplication level is higher than what we've seen and it does seem to be a huge waste of money with 90% duplication. As for whether the unique fraction is any help to you it depends on the what kind of coverage your left with. If your looking to do scaffolding of contigs then the number of reads that you need is surprisingly low. Previously in our lab we tried out mate-pair with IonTorrent PGM on a simple metagenome (~20 bacterial and archael species) and found that even with ~1-7x coverage of the genomes (depending on the community member) we were able to order contigs (with alot of manual inspection of the links). That kind of low coverage data though is probably not going to close gaps or say with certainty the complete genome sequence.

score 2 · Answer 2 · 2013-06-11

I am sure most of the labs have faced similar issues. Though we have SOLiD platforms but we have faced similar issues as most of the platforms have similar protocols for mate-pair libraries. For mate -pair libraries you should start with fairly large amount of dna or you will end up extra amplification causing high duplication rates.

score 1 · Answer 3 · 2013-06-11

1

Entering edit mode

12.1 years ago

Gabriel R. ★ 2.9k

It probably means that you had a poor number of initial sequences. Was the circularization successful for a sufficient # of DNA fragments. We had mate pairs libraries on a Miseq here and we had "high" duplication rates but nothing like what you mentioned.

ADD COMMENT • link 12.1 years ago by Gabriel R. ★ 2.9k

score 1 · Answer 4 · 2013-06-11

1

Entering edit mode

12.1 years ago

Chris Whelan ▴ 590

We saw similar duplication levels in mate-pair libraries when we tried them, although that was a couple of years ago. Even in what was left that wasn't duplicates, there were many "innie" read pairs with short insert sizes. Supposedly the Nextera prep helps a lot with mate pair libraries.

ADD COMMENT • link 12.1 years ago by Chris Whelan ▴ 590

score 1 · Answer 5 · 2013-07-06

Thanks everyone for your reply. I thought its timely to give you some update. We were informed by the service providers that: 1) small insert size mate pair libraries e.g. 1kb,2kb,3kb etc. (compared to 8kb,10kb,20kb etc) produces less duplicates. 2) use of illumina's 'nextera' kit would reduce read duplication level.

So I was excited when two days ago we received new lanes of mate pair reads using 3kb, 5kb and 8kb inserts and 'nextera' kits. Ridiculously all of them have read duplication level >90% !!!

Can anyone please help me finding some published docs/stats or from lab experience on this issue? Also is there any agreed consensus on what should be the accepted level of duplication?

Ram · Answer 6 · 2013-10-17

I believe that you should not rely purely on FASTQC report. I have recently saw data, where the biggest proportion of the reads were unique and FASTQC still reported 95 % duplication level, mostly because of big number of sequences that were present 10+ more times.

From FASTQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information in the final plot any sequences with more than 10 duplicates are placed into the 10 duplicates category - so it's not unusual to see a small rise in this final category. If you see a big rise in this final category then it means you have a large number of sequences with very high levels of duplication.

Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

You may want to take a look at my post here: A: RNAseq library quantification