Entering edit mode
3.5 years ago
snehu.sambare
▴
10
I ran MACS2 on a paired-end CTCF bam file with the following options:
-t treatment -c control -f BAMPE -g 2.9e+9 -p 1e-5 -B.
Upon visualizing the narrowPeak file in IGV, all the peaks are called at the centromeres of the chromosomes.
What has gone wrong with my peak-calling? How can I call-peaks on paired-end data (150bp each read)?
If I remember correctly then you get a summary file from MACS2 with all the peaks called. So one obvious question would be, how many peaks did you call? That there is noise at the centromeres and telomeres is normal (repetitive nature of the sequences there and potential mapping issues I guess if not filtered out).
Are you sure it's the peak calling and not the experiment itself? What organism? (human I'm guessing). How many million reads were generated? What mapping parameters were used? Does a high percentage map to your genome? How many peaks did MACS return? What does the distribution of p-values look like? Can you compare your data set to a positive control? Hundreds of CTCF data sets exist at GEO (e.g. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140252), you could pull down one of those and prove to yourself that you can call peaks with a known data set.
Make a bigwig file with deeptools
bamCoverage
and inspect the data on the IGV, then you see whether this is a peak calling artifact or just the nature of the dataset you have. Or check this pipeup macs created with-B
on a browser, should be rather similar. Also please report total number of peakswc -l your_file.narrowPeak
. Command looks ok-ish, i'd remove the-p
as you should go with the default FDR cutoffs (-q 0.05) unless you have a good reason not to.