Question: If we use MACS2 do we need to remove duplicate sequences with samtools rmdup ?
gravatar for salamandra
11 months ago by
salamandra180 wrote:

Hi, have some questions about sequence duplicates in ChIP-seq analysis:

a) When running fastQC on my samples, they give the error on 'Sequence duplication levels'. I'm going to use MACS2 as peak caller, but read in several places that MACS2 removes duplicates automatically. Does this means that I don't need to worry on removing duplicates with 'samtools rmdup' prior to mapping?

b) Some say we should remove duplicates because they are the result of PCR bias. This is the result of amplification step prior to sequencing. Others defend that these duplicates could be biological and so we shouldn't remove them. What do they mean by biological?

c) If we use a control sample (Input DNA) do we still need to remove duplicates?

chip-seq macs2 duplicates • 1.2k views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 11 months ago by salamandra180
gravatar for Devon Ryan
11 months ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:
  1. Correct, since MACS2 will ignore them anyway you don't need to remove them.
  2. The question is mostly "Are they really PCR duplicates or not?" If you sequence a given area deeply enough you will always have reads/fragments with the same coordinates. It's impossible to know if these are due to being PCR duplicates, or there simply being a lot of signal at a given spot.
  3. Having an input sample is irrelevant to this.
ADD COMMENTlink written 11 months ago by Devon Ryan88k

So, I removed duplicates prior to macs2 ( with bbmap option 'dedupe') cause I was curious of how many reads were left after removing duplicates in fastq files and by mistake I runned macs2 on both the files without duplicates and files with duplicates. It gave different number of peaks! If macs2 removes duplicates shouldn't the number of peaks be the same?

Then, what is most correct to use the files with or without duplicates?

Code follows below:

cd /home/tbarata/
for i in `find $INPUT | grep -i '.*[.]fastq$'`
FILENAME=$(echo $i | rev | cut -f 1 -d '/' | rev )
FILEDIRECTORY=$(echo $i | cut -d'/' -f2- | rev | cut -d'/' -f2- | rev)
docker pull
docker_id=$(docker run -d -t -v /home/tbarata/:/data/ \ bash -c 'touch /data/$OUTPUT/$OUTFILE\_summary; trimmomatic SE -threads 15 -phred33 /data/$i /data/$OUTPUT/$DUPOUTFILE CROP:79 ILLUMINACLIP:/data/allTruSeqAdapSE.fa:2:30:10 LEADING:0 TRAILING:0 SLIDINGWINDOW:0:0 MINLEN:36 AVGQUAL:20 > /data/$OUTPUT/$OUTFILE\_summary')
echo $docker_id
docker wait $docker_id
docker run -t -v /home/tbarata/:/data/ in=/data/$OUTPUT/$DUPOUTFILE out=/data/$OUTPUT/$OUTFILE dedupe

and results were:

wc -l *
505 GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
506 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
43 GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak
47 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by salamandra180

Clumpify is removing duplicate reads, macs2 is removing duplicate alignments. You want the latter, not the former. Also any slight change in background noise will change a couple peak calls, which is all the change you're seeing.

ADD REPLYlink written 6 weeks ago by Devon Ryan88k

Removing duplicate reads does not prevent duplicate alignements?

ADD REPLYlink written 6 weeks ago by salamandra180

A unique read can align in multiple places. So read duplication and duplicate/multiple alignments are distinct.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax62k

thanks So, should I remove duplicate reads in case of chip-seq or is it like as rna-seq that is still on discussion?

ADD REPLYlink written 6 weeks ago by salamandra180

A: Did you remove ChIP-seq duplicates
duplicated read in ChIP seq

I will also tag: ATpoint for an expert opinion.

ADD REPLYlink written 6 weeks ago by genomax62k

Difficult to sort out what exactly caused the different peak numbers. By default if you do not choose --keep-dup=all, MACS will remove any duplicates as defined by same 5' ends prior to fragment pileup. I therefore assume that the alignment between the clumpified and non-clumpified files is slightly different. Maybe check what these different peaks are, perhaps they overlap with ENCODE blacklists or known problematic regions like centromers or the edges of chromosomes, and therefore should be excluded anyways or at least be de-emphasized. I always mark duplicates with samblaster and then remove them with samtools together with alignments of MAPQ < 20.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by ATpoint13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 846 users visited in the last hour