As there is no perfect answer but only homemade solutions, here is mine according to what I'm interested in.
So, as suggested by ATpoint, I need to find a "number of mapped reads on a given area" threshold to limit the call of false positive peaks (bin with high fold change but low mapped reads). With this threshold I should be able to detect which bin may disturb the analysis and remove from bam files the reads involved in those bins.
I started with the raw bam files, I will take one mark for this example, let's say H2AZ like in my post, I have 4 files :
Bam to bedgraph
First step get a bedgraph file with the reads count per bin. For this I used bamCoverage.
--binSize 100 : is a good compromise between file size (which could be very large for bin at 10, for example) and useful count information.
--region chr12 : I only focus on chromosome 12
--normalizeUsing RPGC : Normalization method implement in bamCoverage which is number of reads per bin / (total number of mapped reads * fragment length) / effective genome size
--effectiveGenomeSize 2652783500 : In the RPGC formula we need the effective genome size which is
2652783500 for mm10
Note : By defaut bamCoverage generate bedgraph merging bins with the same read count values, in order to save space. As I wanted to compared read counts per bin I wanted to get the exact same number of bin with the same bin length in each file. Discussion with one of the author leads to uncomment some lines in
Mean standard deviation plot
In order to catch potential false positive peaks comming from bins analysis, I created a meanSDplot under R, using DESeq2 functions, to visualize the
log2(SD) over the
Create a merge file of the bedgraph from last step to create a single matrix in R
Note : Inspecting my meanSDplot, I chose a threshold of log(mean) = 3, to remove high
log(SD) with low
log(mean)? Means that every bins which less than 2^3 = 8 reads as mean (aB/rB) will be discarded. This is why there is a
threshold attribute in
run_meanSDplot function. First run the code without threshold, then when you think you've got one, fill it to get the
R code :
The script output 2 things :
- MeanSDplot to chose a threshold
- List of the "best" bins passing the threshold
High counts bam files
Create new bam file from the raw one with only reads falling in the "best" bin, using bedtools intersect
From here, the filtering part is over, just call your peaks using macs2
macs2 callpeak -t aB_H2AZ.best.bam -c aB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n aB_H2AZ
macs2 callpeak -t rB_H2AZ.best.bam -c rB_input.bam -g mm --outdir peakcall --nomodel --extsize 147 --mfold 5 50 -B -n rB_H2AZ
Note : I used
--extsize 147 because macs2 struggle to build a model with few number of peaks (due to our filtering). The result will go to peakcall directory
Last but not least, find peaks that are only present in active condition, resting condition or present in both.
We will use macs2
bdgdiff which needs the number of mapped reads to apply a library size normalization, using samtools
#Number of mapped reads for raw bam file
for file in *.bam; do echo $file; samtools view -F 0x904 -c $file; done;
Then, run the command line :
macs2 bdgdiff --t1 peakcall/aB_H2AZ_treat_pileup.bdg --t2 peakcall/rB_H2AZ_treat_pileup.bdg --c1 peakcall/aB_H2AZ_control_lambda.bdg --c2 peakcall/rB_H2AZ_control_lambda.bdg --d1 42559264 --d2 27689824 --min-len 150 --max-gap 100 --outdir bdgdiff -o aB_H2AZ.diff.bdg rB_H2AZ.diff.bdg common_H2AZ.diff.bdg
Note : H2AZ is a sharped mark so no need to call the
--broad parameter here.
And, that is it, method looks not perfect as there is some a priori with the threshold value + there is no duplicates. But we managed to limit the effect of false positive peak calling due to some low counts, which is nice.