If we use MACS2 do we need to remove duplicate sequences with samtools rmdup ?
1
3
Entering edit mode
3.9 years ago
salamandra ▴ 420

Hi, have some questions about sequence duplicates in ChIP-seq analysis:

a) When running fastQC on my samples, they give the error on 'Sequence duplication levels'. I'm going to use MACS2 as peak caller, but read in several places that MACS2 removes duplicates automatically. Does this means that I don't need to worry on removing duplicates with 'samtools rmdup' prior to mapping?

b) Some say we should remove duplicates because they are the result of PCR bias. This is the result of amplification step prior to sequencing. Others defend that these duplicates could be biological and so we shouldn't remove them. What do they mean by biological?

c) If we use a control sample (Input DNA) do we still need to remove duplicates?

ChIP-Seq Duplicates MACS2 • 5.5k views
9
Entering edit mode
3.9 years ago
1. Correct, since MACS2 will ignore them anyway you don't need to remove them.
2. The question is mostly "Are they really PCR duplicates or not?" If you sequence a given area deeply enough you will always have reads/fragments with the same coordinates. It's impossible to know if these are due to being PCR duplicates, or there simply being a lot of signal at a given spot.
3. Having an input sample is irrelevant to this.
1
Entering edit mode

That's interesting! I removed duplications by samtools rmdup while after peak calling using MACS2 I got the same results as that without remove duplications

wc -l ko_*.narrowPeak
125573 ko_nodup_peaks.narrowPeak
125573 ko_peaks.narrowPeak


So "since MACS2 will ignore them anyway you don't need to remove them."

0
Entering edit mode

So, I removed duplicates prior to macs2 ( with bbmap clumpify.sh option 'dedupe') cause I was curious of how many reads were left after removing duplicates in fastq files and by mistake I runned macs2 on both the files without duplicates and files with duplicates. It gave different number of peaks! If macs2 removes duplicates shouldn't the number of peaks be the same?

Then, what is most correct to use the files with or without duplicates?

Code follows below:

cd /home/tbarata/
INPUT=ChIPseq_raw_data
OUTPUT=ChIPseq_analysis_20181026/Results
for i in find $INPUT | grep -i '.*[.]fastq$'
do
FILENAME=$(echo$i | rev | cut -f 1 -d '/' | rev )
echo $FILENAME FILEDIRECTORY=$(echo $i | cut -d'/' -f2- | rev | cut -d'/' -f2- | rev) echo$FILEDIRECTORY
DUPOUTFILE=$FILEDIRECTORY/treated_fastq/withduplicates_$FILENAME
OUTFILE=$FILEDIRECTORY/treated_fastq/$FILENAME
docker pull quay.io/biocontainers/trimmomatic:0.36--5
docker_id=$(docker run -d -t -v /home/tbarata/:/data/ \ quay.io/biocontainers/trimmomatic:0.36--5 bash -c 'touch /data/$OUTPUT/$OUTFILE\_summary; trimmomatic SE -threads 15 -phred33 /data/$i /data/$OUTPUT/$DUPOUTFILE CROP:79 ILLUMINACLIP:/data/allTruSeqAdapSE.fa:2:30:10 LEADING:0 TRAILING:0 SLIDINGWINDOW:0:0 MINLEN:36 AVGQUAL:20 > /data/$OUTPUT/$OUTFILE\_summary')
echo $docker_id docker wait$docker_id
docker run -t -v /home/tbarata/:/data/ quay.io/biocontainers/bbmap:38.16--0 clumpify.sh in=/data/$OUTPUT/$DUPOUTFILE out=/data/$OUTPUT/$OUTFILE dedupe
done


and results were:

wc -l *
505 GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
506 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
43 GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak
47 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak

1
Entering edit mode

Clumpify is removing duplicate reads, macs2 is removing duplicate alignments. You want the latter, not the former. Also any slight change in background noise will change a couple peak calls, which is all the change you're seeing.

0
Entering edit mode

Removing duplicate reads does not prevent duplicate alignements?

1
Entering edit mode

A unique read can align in multiple places. So read duplication and duplicate/multiple alignments are distinct.

0
Entering edit mode

thanks So, should I remove duplicate reads in case of chip-seq or is it like as rna-seq that is still on discussion?

0
Entering edit mode

I will also tag: ATpoint for an expert opinion.

1
Entering edit mode

Difficult to sort out what exactly caused the different peak numbers. By default if you do not choose --keep-dup=all, MACS will remove any duplicates as defined by same 5' ends prior to fragment pileup. I therefore assume that the alignment between the clumpified and non-clumpified files is slightly different. Maybe check what these different peaks are, perhaps they overlap with ENCODE blacklists or known problematic regions like centromers or the edges of chromosomes, and therefore should be excluded anyways or at least be de-emphasized. I always mark duplicates with samblaster and then remove them with samtools together with alignments of MAPQ < 20.

0
Entering edit mode

"macs2 is removing duplicate alignments" -> does this mean that it's analogous to do MAPQ filtering with samtools? Because I do this before calling peaks with MACS2 , but after reading this thread I am wondering if it's redundant. Thank you !

0
Entering edit mode

I think you are mixing up things. Read duplicates are those read with the same start coordinate (single-end) or same start and end (paired-end). That means the sequence of the reads is identical. MAPQ refers to mapping quality, and a low MAPQ indicates that it is likely that one read maps equally to more than one location in the genome. It is reasonable to filter for a certain MAPQ as multimappers cannot reliably assigned to one single location. Read deduplication makes sense as duplicates can be due to PCR amplification. As one cannot reliably distinguish without the use of Unique Molecular Identifiers (UMI, as in single-cell RNA-seq for example) one typically removes all duplicates. macs does that by default but it can be turned off. macs does not filter for MAPQ though afaik. I always remove alignments with MAPQ < 20 before feeding into macs. Why 20, well because I saw 20 somewhere in a script when I was a newbie, can also be 10, 30 or 28. Different aligners assign different MAPQ scores, and different aligners also have different MAPQ maxima, so it is somewhat arbitrary. 20 is reasonable though I think.