I have many ChIP-Seq data containing duplicated data. Firstly, I aligned these fastq files into reference genome separately, then I merged these bam files into one bigger bam file. I used MACS to do peak calling. However, many papers did not merge these bam files, but they did peak calling separately and merge these peaks produced by MACS. Does anyone know which one method is better? And how to merge these peaks generated by MACS?
I recommend you to use phantompeakqualtools cross-correlation analysis
Check the column 11 values. If the replicates have values close to each other, you can merge those samples and do single peak calling. Othewise you do peak calling separately and merge/ take the common peaks from both peak calling.
COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
Also recheck/verify the samples using 'plotFingerpring'.
The best approach is to do peak calling separately on each replicate (make sure to use input) and then use either: phantompeakqualtools if you have single end read data (Reference: https://sites.google.com/site/anshulkundaje/projects/idr).
Use ChiLin: https://www.ncbi.nlm.nih.gov/pubmed/27716038 if you have pair-end data to assess the quality of each replicate. Please remember that SPP can be only used for single end read data. So, you better use macs2 peak caller.
Nowadays, in newly coming papers calculating Pearson's correlation for checking read density for overlapping replicates is regarded as a better approach than IDR. So, you should also give it a try.
Then only select those replicates which have significant overlaps. Later, you can merge the peaks for each replicate. Best is to perform downstream analysis on only those peaks which are overlapping. Use Bedtools to merge peaks.