Hello everyone,
I'm learning how to use MACS2 for peak calling and I'm having an issue where 2/6 replicates are only calling 5-6 peaks total (all on ChrM) while the others are successfully calling several thousand peaks across the genome. I working with a published dataset, and based on the published bed files there should be several thousand peaks called for those files that are only calling a handful of erroneous peaks. Does anyone have suggestions of which part of the process I should be looking into for a possible error?
Here's my process:
Download fastq files from SRA fastq-dump --gzip --skip-technical --readids --read-filter pass --dumpbase --split-e --clip SRR<file number="">
Align using bowtie2 (with hg38 index files as a reference) bowtie2 -p 2 -q --local -x <genome files=""> -U <fastq file=""> -S <.sam output file>
Convert from SAM to BAM samtools view -h -S -b -@ 3 -o <.bam output file> <.sam file>
Sort BAM file y genomic coordinates sambamba sort -t 3 -o <sorted .bam="" file=""> <.bam input file>
Filter out multi-mappers and duplicates sambamba view -h -t 3 -f bam -F "[XS] == null and not unmapped and not duplicate" <sorted .bam="" file="" input=""> > <filtered .bam="" output="">
Run MACS2 callpeak macs2 callpeak -t <experimental filtered.bam=""> -c < input filtered.bam> -f BAM -g hs -n <name> --bdg
The alignment for BAM comes out well for both the files that call several thousand peaks and the ones that only call ChrM peaks (99% alignment). MACS2 runs to completion and outputs all of the correct files, but calls very few peaks. The only difference I have noticed between the files that work and those that don't is that for the ones that call a lot of peaks of a d=~450, while for the ones that don't work the d=~125. Any help on where I should look for a possible source of error would be greatly appreciated!