Question: ATAC-seq coverage tracks using bamCoverage from Deeptools
0
gravatar for mickey_95
5 months ago by
mickey_9510
mickey_9510 wrote:

Hello!

I have recently started to use Deeptools for my ATAC-seq data analysis and would like to get some feedback on the conversion of the BAM files to bigwig coverage tracks. I am interested in looking at TSS metaprofiles using either all fragments or fragments of only certain sizes.

For an all-inclusive coverage:

bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x

Am I correct in normalizing to 1x coverage? From what I have seen in the literature, this is what most people are doing with ATAC-seq data. Also, I am uncertain about the effective genome size, which depends on read length. I have trimmed the Tn5 adapter sequence prior to alignment and have now varying read lengths.

For nucleosome-free regions:

bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x --skipNonCoveredRegions --extendReads --maxFragmentLength 120

I have seen that some people also use --minFragmentLength 40, but it doesn't seem to have a great effect when I look at the metaprofiles or in the genome browser. What is your experience? Also, what are your thoughts on skipping non-covered regions and extending the reads to full fragments? Especially the latter confuses me quite a bit. Does the extension result in the entire fragment being counted? If I don't specify this option, is only the 5' read end counted? I am not quite understanding which part of the read is counted by the tool. Shouldn't each read be only counted once instead of throughout its length?

For mononucleosome fragments:

bamCoverage --bam in.bam --outFileName out.bw --binSize 1 --normalizeUsing RPGC --effectiveGenomeSize x --MNase

In this specific case, people seem to not skip non-covered regions - why is that? In my understanding the above code creates files where only the center (3 bp) of fragments between 130 and 200 bp is considered. This reveals the position of the nucleosomes, correct? What about focusing on Tn5 cut-sites? Would I need to use --Offset 1? How exactly?

These are quite a lot of questions. It's quite hard for me to judge what is commonly used for those specific purposes mentioned above. Thanks to anyone who has taken the time to read this and I appreciate all input and help!

ADD COMMENTlink modified 5 months ago • written 5 months ago by mickey_9510
1
gravatar for ATpoint
5 months ago by
ATpoint41k
Germany
ATpoint41k wrote:

I personally prefer to normalize my ATAC-seq (or any RNA-seq/DNA-seq) data with scaling factors from edgeR since this will correct for both library size (as in 1x normalization, and library composition). You'd meed a count matrix typically created from the union peaks of your samples for that. Here is a suggestion on how to apply the normalization. You could use the scaling factors for all tracks regardless how you filter for size.

A: ATAC-seq sample normalization (quantil normalization)

ADD COMMENTlink modified 5 months ago • written 5 months ago by ATpoint41k

Thanks a lot for your help! I will try this way of normalization then. I've also seen people downsample their BAM files to the read number of the lowest sample. What is your take on that?

Speaking of normalization and varying library sizes, I have a rather general question: When pooling different sequencing libraries with different adapters, it frequently happens that one sample gets many more reads compared to the other samples. Why is that? Is this simply due to biological variability among replicates or does some technical effect from the sequencing platform contribute to this?

ADD REPLYlink written 5 months ago by mickey_9510

I've also seen people downsample their BAM files to the read number of the lowest sample. What is your take on that?

That is effectively the same as per-million (1x) normalization and suffers from the same problems as mentioned above.

When pooling different sequencing libraries with different adapters, it frequently happens that one sample gets many more reads compared to the other samples. Why is that?

That happens if the libraries were not loaded at equal concentrations to the sequencer. A proper quantification prior to pooling samples to make the final sequencing pool is necessary to avoid that. This has nothing to do with biological effects, pure technical shortcoming.

ADD REPLYlink written 5 months ago by ATpoint41k

Ah, good to know. Thanks a lot!

ADD REPLYlink written 5 months ago by mickey_9510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour