Hi, have some questions about sequence duplicates in ChIP-seq analysis:
a) When running fastQC on my samples, they give the error on 'Sequence duplication levels'. I'm going to use MACS2 as peak caller, but read in several places that MACS2 removes duplicates automatically. Does this means that I don't need to worry on removing duplicates with 'samtools rmdup' prior to mapping?
b) Some say we should remove duplicates because they are the result of PCR bias. This is the result of amplification step prior to sequencing. Others defend that these duplicates could be biological and so we shouldn't remove them. What do they mean by biological?
c) If we use a control sample (Input DNA) do we still need to remove duplicates?
That's interesting! I removed duplications by samtools rmdup while after peak calling using MACS2 I got the same results as that without remove duplications
So "since MACS2 will ignore them anyway you don't need to remove them."
So, I removed duplicates prior to macs2 ( with bbmap clumpify.sh option 'dedupe') cause I was curious of how many reads were left after removing duplicates in fastq files and by mistake I runned macs2 on both the files without duplicates and files with duplicates. It gave different number of peaks! If macs2 removes duplicates shouldn't the number of peaks be the same?
Then, what is most correct to use the files with or without duplicates?
Code follows below:
and results were:
Clumpify is removing duplicate reads, macs2 is removing duplicate alignments. You want the latter, not the former. Also any slight change in background noise will change a couple peak calls, which is all the change you're seeing.
Removing duplicate reads does not prevent duplicate alignements?
A unique read can align in multiple places. So read duplication and duplicate/multiple alignments are distinct.
thanks So, should I remove duplicate reads in case of chip-seq or is it like as rna-seq that is still on discussion?
A: Did you remove ChIP-seq duplicates
duplicated read in ChIP seq
I will also tag: ATpoint for an expert opinion.
Difficult to sort out what exactly caused the different peak numbers. By default if you do not choose
--keep-dup=all
, MACS will remove any duplicates as defined by same 5' ends prior to fragment pileup. I therefore assume that the alignment between the clumpified and non-clumpified files is slightly different. Maybe check what these different peaks are, perhaps they overlap with ENCODE blacklists or known problematic regions like centromers or the edges of chromosomes, and therefore should be excluded anyways or at least be de-emphasized. I always mark duplicates withsamblaster
and then remove them withsamtools
together with alignments of MAPQ < 20."macs2 is removing duplicate alignments" -> does this mean that it's analogous to do MAPQ filtering with samtools? Because I do this before calling peaks with MACS2 , but after reading this thread I am wondering if it's redundant. Thank you !
I think you are mixing up things. Read duplicates are those read with the same start coordinate (single-end) or same start and end (paired-end). That means the sequence of the reads is identical. MAPQ refers to mapping quality, and a low MAPQ indicates that it is likely that one read maps equally to more than one location in the genome. It is reasonable to filter for a certain MAPQ as multimappers cannot reliably assigned to one single location. Read deduplication makes sense as duplicates can be due to PCR amplification. As one cannot reliably distinguish without the use of Unique Molecular Identifiers (UMI, as in single-cell RNA-seq for example) one typically removes all duplicates.
macs
does that by default but it can be turned off. macs does not filter for MAPQ though afaik. I always remove alignments with MAPQ < 20 before feeding into macs. Why 20, well because I saw 20 somewhere in a script when I was a newbie, can also be 10, 30 or 28. Different aligners assign different MAPQ scores, and different aligners also have different MAPQ maxima, so it is somewhat arbitrary. 20 is reasonable though I think.