I’m analyzing Cut&Run data for a transcription factor with the following setup:
Multiple time points with replicates
IgG control processed identically as a BAM file as control for macs2
Libraries have excellent quality (0.7% duplication rate before peak calling)
Using MACS2 callpeak with BAMPE format and hg38 reference and also using a IgG control which i ran through the same steps until peak calling and then used it as a BAM file control against the treatment
Current filtering: Using samtools filter with less than 120bp size cutoff for TF analysis as per https://doi.org/10.1186/s13059-019-1802-4
Also filtering for “filters”: [ { “id”: “1”, “isProperPair”: “true”, “mapQuality”: “>=30”, “reference”: “!chrM” } ]
Main Issues:
Very low peak counts: Only 11-16 peaks per sample (expecting more)
No motifs found: MEME-ChIP returns no significant motifs
Also seeing low alignment percentage of about 50 - 70 across samples after bowtie2
High duplicate reads in both sample and control - 40 to 50 per cent before and after alignment. I performed dedup but many CUTRUN analysis methods mention to retain duplicates.
Questions:
Should I remove the <120bp size filter for Cut&Run peak calling?
Should i not remove duplicates? In some cut and run analysis I have seen that they recommend not to remove. https://doi.org/10.1186/s13059-019-1802-4
Are there better motif discovery tools than MEME-ChIP for novel TF targets with few peaks?
Could MACS2 parameters be too stringent?
Thank you for the insightful feedback. You're absolutely right, and I'd like to clarify the duplication rates:
Cut&Run data for a transcription factor with the following setup:
Multiple time points with replicates
IgG control processed identically as a BAM file as control for macs2
Raw sample files each had approximately 45 million reads and decent quality except high duplication rate in the samples (about 30 to 40 percent in most files). I have read that this is fairly common for CUT&RUN data but usually goes up to 10 to 15 percent.
Further cleanup was performed with Trimmomatic with a loss of about 10 percent total reads.
Alignment with bowtie2 to hg38 reference gave stats: (all bam are around these thresholds).
The dovetail setting was allowed.
About 50 to 60 percent of all (trimmed) reads were mapped.
I also filtered for: "isProperPair" "true", "mapQuality" ">=30", "reference" "not chrM" "Insert size less than 120".
The last filter was based on input from Henikoff et al., 2017 which mentions that less than 120 bp fragments represent cleavages around TF binding sites.
Post deduplication, there were about 10 to 15 million reads (with 0.7 percent duplication). Peak calling with macs2 (BAMPE, with IgG control BAM that I ran through the same steps) gave:
5 peaks for most samples. The third time point has 9 to 15 peaks per replicate. Cross-correlation graphs are weak.
I'm not too sure what to make of this data, especially since it looks like data quality is fine. Any insight or advice would be greatly appreciated.
As I said, check data on the IGV (for example use deeptools bamCoverage on the bm files with a bin size of 1 to make bigwigs), see whether peaks are present or whether it's mostly noise. That alone is the QC that is relevant. Seems like data is not usable from what I read.
I re-ran the CUT&RUN data analysis pipeline with adjusted filtering criteria and used a finer bin size of 5 for BAM coverage generation. The IgG control was processed through the same pipeline. I have here two files:
overall - https://drive.google.com/file/d/1HByi7IkcegxzLly1o2YJ5f5L87wrxTjR/view?pli=1
chr7 at a peak - https://drive.google.com/file/d/15UkyvDZkl6RKOI40ft6kZuv40J8rnJYg/view
I just used this specific location to display the peak calls vs the coverage as an example.
This is my first time analyzing such data, so I am unsure of how to interpret the peaks I obtained. I obtained about 50-200 peaks across the replicates. I am also trying to use bed tools intersect to see overlapping peaks.
Additional info:
Trimming removed about 10% of reads Alignment statistics were similar to the previous run - 60 per cent overall mapping Below is data of the correlation matrix after deduplication for one of the samples. Other samples also showed values of 0.7-0.8
As ATpoint has repeatedly said, actually look at the data. What does it look like in IGV? If you cannot see numerous obvious peaks by eye, then it is likely the assay failed or is of poor quality. Low numbers of peaks called is indicative of a failed assay.
People are also unlikely to download arbitrary files from dropbox or whatever, embed the images in your post.
I have inserted those images below for reference. I just need some help with understanding if this data makes sense or if it is noise. I do see peaks at locations where MACS2 has identified them, but yes, the number of peaks for each replicate areabout 50-200. I do not have previous literature for this dataset to see where I might be able to observe peaks.
Thanks.
Can you show an intermediate resolution, e.g. a few 100 kb of elements you'd expect this TF to bind (promoters, enhancers, whatevre)?
This looks highly artifactual to me, but the scale being rather blown out by the one huge peak makes it tough to tell.
This is around 16kb near the exon 1 of a gene of interest and all the replicates. However, MACS2 did not call a peak here.
I also have another image below, at a region where it did call a peak
![enter image description here][3]
There is mostly noise and no clear signal. It suggests that, as said before, IP was bad.
Is there any other way to confirm this (quantitatively)? I am trying to conclude this analyses/see if I can find useful information.
The low number of peaks called is your confirmation. You've visually confirmed there's no real issue with peak calling in your data - there are just no peaks to call. I don't know what else one would expect as confirmation.
I'd ensure your antibody actually works via orthogonal methods. And/or run a side by side with an antibody that your for sure know works for CUTandRUN the next time you try the assay.