ChipSeq using MACS2 - option to not call peaks with only few reads ?
1
3
Entering edit mode
3.0 years ago

I'm a bit lost in my ChipSeq experiment (first time)

I'm studying multiples histone marks in mouse but i'll focus on H2AZ which is a sharp mark. I've got 2 conditions (aB which is stimulating, rB which is resting) and I want to know which peaks are only in rB and which peaks are only in rB in this histone mark.

I've aligned my fastq using BWA (aB_H2AZ, rB_H2AZ, aB_input and rB_input)

I called my peaks using MACS2

macs2 callpeak -t aligned/aB_H2AZ.bam -c aligned/aB_input.bam -g mm --outdir peakcall -B -n aB_H2AZ
macs2 callpeak -t aligned/rB_H2AZ.bam -c aligned/rB_input.bam -g mm --outdir peakcall -B -n aB_H2AZ


Then, I tried to call the differential peaks using :

macs2 bdgdiff --t1 peakcall/aB_H2AZ_treat_pileup.bdg --t2 peakcall/rB_H2AZ_treat_pileup.bdg --c1 peakcall/aB_H2AZ_control_lambda.bdg --c2 peakcall/rB_H2AZ_control_lambda.bdg --d1 42559264 --d2 27689824 --min-len 150 --max-gap 100 --cutoff 0 --outdir bdgdiff -o aB_H2AZ.diff.bdg rB_H2AZ.diff.bdg common_H2AZ.diff.bdg


--d1 : Number of mapped reads in aB_H2AZ

--d2 : Number of mapped reads in rB_H2AZ

When I'm scrolling on IGV to look at peaks discovery I've got some good peaks with a correct likelihood ratio (15) for a correct amount of mapped reads ratio aB/rB (200/30). But I also get a correct likelihood ratio (8) for a dumb amount of mapped reads ratio aB/rB (5/0). I've looked at the callpeak and bdgdiff parameters but I can't find any option to not call peaks with very few reads...

macs2 ChIP-Seq • 2.0k views
1
Entering edit mode

Assuming you do not have replicates to run a bayesian framework such as csaw, I would think about some prefiltering prior to bdgdiff. Maybe use either a window-based approach (maybe half the average peak width) or the peak coordinates themselves to remove from the bedgraphs all entries when having enrichments or depletions larger than a certain threshold & a count in either of the conditions lower than a certain threshold. For the count threshold, I would probably inspect an MA plot to get an idea at which A (x axis) the M's become unreasonably large.

3
Entering edit mode
3.0 years ago

As there is no perfect answer but only homemade solutions, here is mine according to what I'm interested in. So, as suggested by ATpoint, I need to find a "number of mapped reads on a given area" threshold to limit the call of false positive peaks (bin with high fold change but low mapped reads). With this threshold I should be able to detect which bin may disturb the analysis and remove from bam files the reads involved in those bins.

I started with the raw bam files, I will take one mark for this example, let's say H2AZ like in my post, I have 4 files :

• aB_H2AZ.bam
• rB_H2AZ.bam
• aB_input.bam
• rB_input.bam

## Bam to bedgraph

First step get a bedgraph file with the reads count per bin. For this I used bamCoverage.

--binSize 100 : is a good compromise between file size (which could be very large for bin at 10, for example) and useful count information.

--region chr12 : I only focus on chromosome 12

--normalizeUsing RPGC : Normalization method implement in bamCoverage which is number of reads per bin / (total number of mapped reads * fragment length) / effective genome size

--effectiveGenomeSize 2652783500 : In the RPGC formula we need the effective genome size which is 2652783500 for mm10

Note : By defaut bamCoverage generate bedgraph merging bins with the same read count values, in order to save space. As I wanted to compared read counts per bin I wanted to get the exact same number of bin with the same bin length in each file. Discussion with one of the author leads to uncomment some lines in writeBedGraph.py

https://github.com/deeptools/deepTools/blob/master/deeptools/writeBedGraph.py#L248-L256

## Mean standard deviation plot

In order to catch potential false positive peaks comming from bins analysis, I created a meanSDplot under R, using DESeq2 functions, to visualize the log2(SD) over thelog2(mean)

Create a merge file of the bedgraph from last step to create a single matrix in R

Note : Inspecting my meanSDplot, I chose a threshold of log(mean) = 3, to remove high log(SD) with low log(mean)? Means that every bins which less than 2^3 = 8 reads as mean (aB/rB) will be discarded. This is why there is a threshold attribute in run_meanSDplot function. First run the code without threshold, then when you think you've got one, fill it to get the best_bins

R code :

The script output 2 things :

• MeanSDplot to chose a threshold
• List of the "best" bins passing the threshold

## High counts bam files

Create new bam file from the raw one with only reads falling in the "best" bin, using bedtools intersect

## Peak calling

From here, the filtering part is over, just call your peaks using macs2 callpeak

macs2 callpeak -t aB_H2AZ.best.bam -c aB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n aB_H2AZ
macs2 callpeak -t rB_H2AZ.best.bam -c rB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n rB_H2AZ


Note : I used --nomodel and --extsize 147 because macs2 struggle to build a model with few number of peaks (due to our filtering). The result will go to peakcall directory

## Differential peaks

Last but not least, find peaks that are only present in active condition, resting condition or present in both.

We will use macs2 bdgdiff which needs the number of mapped reads to apply a library size normalization, using samtools

#Number of mapped reads for raw bam file
for file in *.bam; do echo $file; samtools view -F 0x904 -c$file; done;
#aB_H2AZ.bam
#42559264
#rB_H2AZ.bam
#27689824


Then, run the command line :

macs2 bdgdiff --t1 peakcall/aB_H2AZ_treat_pileup.bdg --t2 peakcall/rB_H2AZ_treat_pileup.bdg --c1 peakcall/aB_H2AZ_control_lambda.bdg --c2 peakcall/rB_H2AZ_control_lambda.bdg --d1 42559264 --d2 27689824 --min-len 150 --max-gap 100 --outdir bdgdiff -o aB_H2AZ.diff.bdg rB_H2AZ.diff.bdg common_H2AZ.diff.bdg


Note : H2AZ is a sharped mark so no need to call the --broad parameter here.

And, that is it, method looks not perfect as there is some a priori with the threshold value + there is no duplicates. But we managed to limit the effect of false positive peak calling due to some low counts, which is nice.