Low number of peaks and weird TSS profile
1
0
Entering edit mode
4.3 years ago
mickey_95 ▴ 110

Hi,

I am looking at ATAC-seq data of good/decent quality. I used MACS2 for peak calling and was surprised to see los numbers of peaks being called - the sample with the least peaks has only 700 peaks...I observed another strange thing when plotting peaks in TSS regions: Instead of a profile with a peak at the TSS, I have a double peak with a small dip inbetween that falls right on the TSS. Has someone seen such a profile in ATAC-seq data? I am not very experienced with ATAC data analysis and appreciate any input!

ATAC-seq • 3.4k views
ADD COMMENT
0
Entering edit mode

Good data quality and 700 peaks are mutually exclusive in my experience. How do you define "good quality", is this human/mouse? Can you share a screenshot from a genome browser? Please also share that plot you mention.

ADD REPLY
0
Entering edit mode

Again a disclaimer that I am unexperienced, so apologies if I am missing something obvious or doing something wrong. I want to learn and improve, so I'm happy about help :)

It is data from mouse. Good quality: PCR duplication levels in the range expected for ATAC (20-50%), high alignment rates (>98% of reads), fragment size distribution as described in the original ATAC paper.

For the TSS plot I used ChIPseeker's getTagMatrix and plotAvgProf functions.

ADD REPLY
0
Entering edit mode
4.3 years ago
ATpoint 81k

My two cents:

I find the most meaningful metric to judge data quality is the FRiP = fraction of reads per peak. In other words what percentage of reads overlaps callable peaks and what percentage doesn't. With 700 peaks FRiPs will be very very low. I personally think that none of the metrics you use is informative. High alignment percentage is good but does not tell you about data quality but only if you have sufficient read length and no contamination frrom foreign DNA. Duplication levels are not really a quality criterium in terms of "it tells you if the sample is good". It only tells you if something is terribly wrong, not if it is good. Is this standard ATAC-seq like in the 2013 paper from J. Buenrostro or a more recent protocol? I suggest you use OmniATAC in the future for the wetlab part which is by far the best I've used so far. Gives superb data quality if the wetlab part is done properly and input cells were nice and viable.

About this TSS plot I do not know what it might tell or not, I do not do it as a QC metric.

In summary, 700 peaks indicates low data quality. Can you give some details about the experiment?

Not sure if this helps you but this is how this locus looks in an excellent ATAC-seq sample from murine bone marrow cells. High peaks and little to no observable noise at that scale in the non-peak regions. Yours looks a bit noise to be honest. As said, please share some details on the experiment so maybe we can debug.

enter image description here

I suggest you convert your bam to bigwig with bamCoverage from deeptools like:

bamCoverage -bs 1 -o out.bigwig --bam in.bam so that you can zoom out more and get a better overall impression. If you do that please also share a zoomed-out screenshot, say a 150kb window on IGV at this locus here. Please also share the command line for peak calling. How many reads are in that sample?

ADD COMMENT
0
Entering edit mode

Thank you for the quick reply and your helpful input!

Ah, I forgot to mention my FRiP scores, they range between 5 and 10% for all samples. I'll convert to bigwig as you suggested and will upload a screenshot in a bit.

We used the Buenrostro 2013 protocol. As for the experimental set-up: We have mouse liver tissue, WT and 2 different treatments, 3 biological replicates per condition and we used 100,000 cells per sample. I am not familiar with OmniATAC, have you also used it for tissue? Also, in your experience with ATAC data, how many unique reads and peaks do you usually get? And FRiP? Just to get a feeling what is a good dataset.

ADD REPLY
0
Entering edit mode

When using OmniATAC on freshly sorted cells from bone marrow I get FRiPs between 40-60% and something between 90.000 - 150.000 peaks. Still absolute peak numbers may vary depending on peak caller but it should be somewhat 50k ad more if things work in any case. This I found fairly reproducible when using Omni, regardless of species (mouse/human) and fresh or cell line material. Typically use 50k cells, but it also works with fewer like 10k cells. 100k is actually a lot. I strongly recommend switching protocols. Omni is far superior to the original protocol. It notably increases data quality and reduces amounts of mitochondrial reads to < 5% in my experience. No downsides with this protocol and actually no reason to use the "old" one given that this excellent alternative exists.

How did you calculate the FRiPs? With only 700 peaks I can hardly imagine this gets to 10%.

ADD REPLY
0
Entering edit mode

Wow, this sounds great, I will definitely look more into it and think about setting up an OmniATAC experiment soon

I calculated FRiPs using ChIPQC::QCmetrics. How do you do it?

ADD REPLY
0
Entering edit mode

I typically use featureCounts. It makes a count matrix but also outputs the percentage of reads assignable to those peaks so I simply take it from there.

ADD REPLY
0
Entering edit mode

Here is a zoomed-out screenshot from two of my ATAC samples (after converting to bigwig, thanks for the recommendation!). The number of mapped reads for the two samples is 50 mio and 65 mio, respectively.

Before peak calling I removed PCR duplicates and mitochondrial reads. Do you recommend any additional filtering? For peak calling I used the following command:

macs2 callpeak -t input.bam -f BAMPE -g mm -q 0.05 --keep-dup 1 --bdg

enter image description here

ADD REPLY
0
Entering edit mode

Hmm, looks actually ok. I mean it is not the most beautiful track I've ever seen but definitely not bad Fits this FRiP score you reported I guess. Given that the rest of the genome looks somewhat similar it is odd that you get only 700 peaks. I cannot get my head around why without seening the data myself, sorry.

For FRiP, see also A: FRIP score ATAC-seq if you want.

ADD REPLY
0
Entering edit mode

Thanks for all your help! Following your suggestion, I used featureCounts to calculate FRiP. I followed the code in your post you referenced, with one slight modification I had to make because featureCounts was throwing an error about the input SAF file - it required 5 columns with one of them being the chr column. Did you also have the same problem?

awk 'OFS="\t" {print $1"-"$2+1"-"$3, $1, $2+1, $3, "."}' peaks.narrowPeak > peaks.saf
featureCounts -p -a peaks.saf -F SAF -o out.txt input.bam

Is this correct? The FRiP I got with this is 6.9%, the one from ChIPQC::QCmetrics was around 9%. What I noticed when looking at the featureCounts output, however, was that the number of total alignments was almost half the number of mapped reads in my input bam file (checked with samtools). Shouldn't those two numbers be the same?

Another thing I noticed when looking at the read numbers in my bam files: I previously said that I had 50/60 mio. mapped reads, but that was before filtering duplicates and mito reads. After filtering I am left with 10-20 mio. reads...I guess that would be the explanation for the low peak numbers?

ADD REPLY

Login before adding your answer.

Traffic: 2971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6