I’m diving into the use of MACS3 for analyzing paired-end sequencing data, especially focusing on telomeric regions. I’m exploring how variations in the keep-dup parameter impact peak detection and coverage assessment.
In my experiments, setting keep-dup=all retains all tags, while switching to keep-dup=1 drastically reduces the tag count from approximately 46 million to just over 10 million. This raises important questions about how to accurately evaluate coverage given the substantial drop in retained tags.
I’m also considering additional metrics beyond peak length and chromosome distribution, using the -B
and --SPMR
options for comparing BigWig files. Are these sufficient for this analysis and determine which modifications are better for the analysis?
Additionally, I’m contemplating the -f BAMPE
option for more precise insert size estimation, given the paired-end nature of my data. However, the enrichment in telomeric regions, which may have high mapping quality (MQ = 0), makes me wonder about its effect on insert size accuracy. Should I continue with the default callpeaks
options to keep the left mate (5’ end tag)?
Thank you!
What's type of data is this? e.g. chip-seq?
That seems like a high duplication rate. Is this data enriched for telomere sequences?
-B
and--SPMR
are sufficient for generating signal files.BAMPE format may make sense here, as it relies on where the reads are already mapped, so you wont' be incorporating additional information anyways, and should be at least as accurate as MACS3's method to estimate insert sizes with single end data (because that also relies on read alignments, but now lack the benefit of pairing).
BAMPE makes less sense if you're more interested in read ends, e.g. ATAC-seq, rather than the middle of fragments.
Overall, I'm thinking it would be good to perform parallel analysis with and without duplicates/multimappers to see if the downstream conclusions are fundamentally different.
thanks @rfran010, I'm working with DNA that has been probe-selected using biotinylated telomeric repeats, which are then captured with streptavidin-coated magnetic beads. Would it be beneficial to calculate the Fraction of Reads in Peaks (FRIP) in this context? If I use --keep-dup=1, in order to calculate FRIP accurately, did you recommend using Samtools markdup or Picard's MarkDuplicates to filter the original BAM file.
If these have MAPQ of 0 then you are probably forced to convert to BEDPE format and go along with this file as (I guess) macs will ignore reads with MAPQ=0, does it? Not sure. Many tools do.
I'm thinking ideally there would be UMIs. Since it's targeting an isolated, repetitive region, it's hard to determine if all the duplicates are from the repetitive, targeted nature of the assay, or, since general library complexity is lower, are they mostly introduced by PCR?
If you use samtools or MarkDuplicates, then you would choose
--keep-dup all
to avoid MACS3's duplicate finding mode. I'm not sure if one would perform better than the other. I tend to use MarkDuplicates.You may also consider the
--keep-dup auto
option.I still think running parallel analyses with and without duplicates would be good. It's hard to say if FRIP will be accurate or not since that would depend on if they are true duplicates or just fragments from the same region/with the same sequence.
Another consideration that could be helpful, if fragmentation of the library was below general read length, then marking duplicates could be more reliable as fragments/reads would have more random sizing and could better be distinguished as duplicates.
Thanks, I have used the options
--keep-dup = all
and--keep-dup = 1
, without previously performing a filter process with samtools or Picard. Would you recommend the auto option over the all option. Thanks for the support and assistance.