Illumina Mate Pair Read Duplication Level
6
2
Entering edit mode
9.3 years ago
abc ▴ 40

Hi,

Can anyone please help with some known statistics for illumina mate pair libraries read duplication level. We have a lane of hiseq 8kb mate pair reads (200million genomic reads, 100bp). FastQC shows ~95% duplicated reads and CLC shows ~92%, which indicates extremely high level of duplication! When we pointed it to our sequence providers they said 80~90%duplication in common in 8kb mate pair library. Is it really the case? We understand read duplication can be high in Mate pair libraries, however, if it is 80-90% range, are those remaining only 10-20% unique reads any helpful for projects like denovo assembly(scaffolding, closing gaps etc)? Please shed some light if you've faced or seen issues like this.

Cheers.

illumina qc • 5.7k views
ADD COMMENT
2
Entering edit mode
9.3 years ago
cts ★ 1.7k

My has only just started doing mate-pair sequencing (although 3kb not 8kb) using the Nextera prep kit. The duplication level is higher than what we've seen and it does seem to be a huge waste of money with 90% duplication. As for whether the unique fraction is any help to you it depends on the what kind of coverage your left with. If your looking to do scaffolding of contigs then the number of reads that you need is surprisingly low. Previously in our lab we tried out mate-pair with IonTorrent PGM on a simple metagenome (~20 bacterial and archael species) and found that even with ~1-7x coverage of the genomes (depending on the community member) we were able to order contigs (with alot of manual inspection of the links). That kind of low coverage data though is probably not going to close gaps or say with certainty the complete genome sequence.

ADD COMMENT
2
Entering edit mode
9.3 years ago

I am sure most of the labs have faced similar issues. Though we have SOLiD platforms but we have faced similar issues as most of the platforms have similar protocols for mate-pair libraries. For mate -pair libraries you should start with fairly large amount of dna or you will end up extra amplification causing high duplication rates.

ADD COMMENT
1
Entering edit mode
9.3 years ago
Gabriel R. ★ 2.8k

It probably means that you had a poor number of initial sequences. Was the circularization successful for a sufficient # of DNA fragments. We had mate pairs libraries on a Miseq here and we had "high" duplication rates but nothing like what you mentioned.

ADD COMMENT
1
Entering edit mode
9.3 years ago
Chris Whelan ▴ 550

We saw similar duplication levels in mate-pair libraries when we tried them, although that was a couple of years ago. Even in what was left that wasn't duplicates, there were many "innie" read pairs with short insert sizes. Supposedly the Nextera prep helps a lot with mate pair libraries.

ADD COMMENT
1
Entering edit mode
9.3 years ago
abc ▴ 40

Thanks everyone for your reply. I thought its timely to give you some update. We were informed by the service providers that: 1) small insert size mate pair libraries e.g. 1kb,2kb,3kb etc. (compared to 8kb,10kb,20kb etc) produces less duplicates. 2) use of illumina's 'nextera' kit would reduce read duplication level.

So I was excited when two days ago we received new lanes of mate pair reads using 3kb, 5kb and 8kb inserts and 'nextera' kits. Ridiculously all of them have read duplication level >90% !!!

Can anyone please help me finding some published docs/stats or from lab experience on this issue? Also is there any agreed consensus on what should be the accepted level of duplication?

ADD COMMENT
0
Entering edit mode

Biostar is a Q&A and not a forum. I would suggest to create a new question rather than adding an answer that contains a new question.

ADD REPLY
0
Entering edit mode
9.0 years ago

I believe that you should not rely purely on FASTQC report. I have recently saw data, where the biggest proportion of the reads were unique and FASTQC still reported 95 % duplication level, mostly because of big number of sequences that were present 10+ more times.

From FASTQC manual:

To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information in the final plot any sequences with more than 10 duplicates are placed into the 10 duplicates category - so it's not unusual to see a small rise in this final category. If you see a big rise in this final category then it means you have a large number of sequences with very high levels of duplication.

Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

You may want to take a look at my post here: A: RNAseq library quantification

ADD COMMENT
0
Entering edit mode

I realized that I never thought this through so now I spent some time to understand it So what does the sequence duplication rate really mean in a FastQC report

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6