our group switched to NovaSeq and since then we have seen very strange data, meaning that we have extremly high read counts (over 200 million reads in the BAM files for paired-end data) and the number of reads remaining after using Picard MarkDuplicates goes down drastically. The most extreme case was from 270 million to 9 million reads. We talk now from ChIP-seq data (TFs)
I am not on the experimental side, but as far as I know they did not change the protocol for library prep. Is there something we should know or how to solve this issue? Can this data be used as it is now?
Before NovaSeq we did not have this kind of problems (with HiSeq).
it can be that because the much higher throughput of the novaseq machines your libraries are not complex enough and you thus sequence the same sequences over and over resulting in what could seen like duplicates indeed.
200-300 million reads is much higher than what is normally done for ChIP-seq samples. Since for ChIP-seq you can start out with a fairly low number of molecules going into PCR, your initial library complexity is probably low - meaning a low number of unique molecules to sequence per samples. Since PCR duplicates tend to have a fairly uniform distribution, you are probably sequencing most unique molecules by 10 million or so reads, and then you are just sequencing PCR duplicates of those molecules for the remainder.
In addition to what others have said my recommendation is to run clumpify.sh (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) on this dataset. This will allow you to identify duplicates without doing any alignments. Find out how many optical duplicates you have as opposed to other types of duplicates. It is possible that your facility is overloading these FC in addition to the fact that these are low complexity libraries to begin with.
I would personally not redo the sequencing, in most cases that will not result in any "better" result. If you do the whole experiment again (eg. starting from sample/lib-prep etc) then you might be able to run a more efficient sequencing but in the end the result you have now is not wrong, it's merely an artifact of the improved sequencing technology.
Thank you all for the very fast reply! I will try out clumpify for sure.
I still would like to have your opinion whether you would redo the sequencing or not.
I would personally not redo the sequencing, in most cases that will not result in any "better" result. If you do the whole experiment again (eg. starting from sample/lib-prep etc) then you might be able to run a more efficient sequencing but in the end the result you have now is not wrong, it's merely an artifact of the improved sequencing technology.