Question: If we use MACS2 do we need to remove duplicate sequences with samtools rmdup ?
1
gravatar for salamandra
15 months ago by
salamandra200
salamandra200 wrote:

Hi, have some questions about sequence duplicates in ChIP-seq analysis:

a) When running fastQC on my samples, they give the error on 'Sequence duplication levels'. I'm going to use MACS2 as peak caller, but read in several places that MACS2 removes duplicates automatically. Does this means that I don't need to worry on removing duplicates with 'samtools rmdup' prior to mapping?

b) Some say we should remove duplicates because they are the result of PCR bias. This is the result of amplification step prior to sequencing. Others defend that these duplicates could be biological and so we shouldn't remove them. What do they mean by biological?

c) If we use a control sample (Input DNA) do we still need to remove duplicates?

chip-seq macs2 duplicates • 1.6k views
ADD COMMENTlink modified 15 months ago by Biostar ♦♦ 20 • written 15 months ago by salamandra200
5
gravatar for Devon Ryan
15 months ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:
  1. Correct, since MACS2 will ignore them anyway you don't need to remove them.
  2. The question is mostly "Are they really PCR duplicates or not?" If you sequence a given area deeply enough you will always have reads/fragments with the same coordinates. It's impossible to know if these are due to being PCR duplicates, or there simply being a lot of signal at a given spot.
  3. Having an input sample is irrelevant to this.
ADD COMMENTlink written 15 months ago by Devon Ryan90k

So, I removed duplicates prior to macs2 ( with bbmap clumpify.sh option 'dedupe') cause I was curious of how many reads were left after removing duplicates in fastq files and by mistake I runned macs2 on both the files without duplicates and files with duplicates. It gave different number of peaks! If macs2 removes duplicates shouldn't the number of peaks be the same?

Then, what is most correct to use the files with or without duplicates?

Code follows below:

cd /home/tbarata/
INPUT=ChIPseq_raw_data
OUTPUT=ChIPseq_analysis_20181026/Results
for i in `find $INPUT | grep -i '.*[.]fastq$'`
do
FILENAME=$(echo $i | rev | cut -f 1 -d '/' | rev )
echo $FILENAME
FILEDIRECTORY=$(echo $i | cut -d'/' -f2- | rev | cut -d'/' -f2- | rev)
echo $FILEDIRECTORY
DUPOUTFILE=$FILEDIRECTORY/treated_fastq/withduplicates_$FILENAME
OUTFILE=$FILEDIRECTORY/treated_fastq/$FILENAME
docker pull quay.io/biocontainers/trimmomatic:0.36--5
docker_id=$(docker run -d -t -v /home/tbarata/:/data/ \
quay.io/biocontainers/trimmomatic:0.36--5 bash -c 'touch /data/$OUTPUT/$OUTFILE\_summary; trimmomatic SE -threads 15 -phred33 /data/$i /data/$OUTPUT/$DUPOUTFILE CROP:79 ILLUMINACLIP:/data/allTruSeqAdapSE.fa:2:30:10 LEADING:0 TRAILING:0 SLIDINGWINDOW:0:0 MINLEN:36 AVGQUAL:20 > /data/$OUTPUT/$OUTFILE\_summary')
echo $docker_id
docker wait $docker_id
docker run -t -v /home/tbarata/:/data/ quay.io/biocontainers/bbmap:38.16--0 clumpify.sh in=/data/$OUTPUT/$DUPOUTFILE out=/data/$OUTPUT/$OUTFILE dedupe
done

and results were:

wc -l *
505 GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
506 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep1_peaks.narrowPeak
43 GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak
47 withduplicates_GFI1B_3TF_day2_macsFDR_0.05_rep2_peaks.narrowPeak
ADD REPLYlink modified 5 months ago • written 5 months ago by salamandra200
1

Clumpify is removing duplicate reads, macs2 is removing duplicate alignments. You want the latter, not the former. Also any slight change in background noise will change a couple peak calls, which is all the change you're seeing.

ADD REPLYlink written 5 months ago by Devon Ryan90k

Removing duplicate reads does not prevent duplicate alignements?

ADD REPLYlink written 5 months ago by salamandra200
1

A unique read can align in multiple places. So read duplication and duplicate/multiple alignments are distinct.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax68k

thanks So, should I remove duplicate reads in case of chip-seq or is it like as rna-seq that is still on discussion?

ADD REPLYlink written 5 months ago by salamandra200

A: Did you remove ChIP-seq duplicates
duplicated read in ChIP seq

I will also tag: ATpoint for an expert opinion.

ADD REPLYlink written 5 months ago by genomax68k

Difficult to sort out what exactly caused the different peak numbers. By default if you do not choose --keep-dup=all, MACS will remove any duplicates as defined by same 5' ends prior to fragment pileup. I therefore assume that the alignment between the clumpified and non-clumpified files is slightly different. Maybe check what these different peaks are, perhaps they overlap with ENCODE blacklists or known problematic regions like centromers or the edges of chromosomes, and therefore should be excluded anyways or at least be de-emphasized. I always mark duplicates with samblaster and then remove them with samtools together with alignments of MAPQ < 20.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint17k

That's interesting! I removed duplications by samtools rmdup while after peak calling using MACS2 I got the same results as that without remove duplications

wc -l ko_*.narrowPeak
  125573 ko_nodup_peaks.narrowPeak
  125573 ko_peaks.narrowPeak

So "since MACS2 will ignore them anyway you don't need to remove them."

ADD REPLYlink modified 15 days ago by genomax68k • written 15 days ago by Wang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 665 users visited in the last hour