I recently generated ATACseq data for 10 patient samples, but their visualization of RPGC normalized bigwig file on IGV is too different from one another when the data range is group-autoscaled. In contrast, when they were individually autoscaled, it shows comparable size of consensus peak and differentially expressed peaks in the patient group as we expected to see.
The example of it is shown in the attached picture.
I assume this variance came from varying library complexity caused by different number of cells used during the library prep. My PI wants to reduce the difference between these samples, so I tried to normalize bam files of these samples by using bamCoverage and scale factor, calculated from DEseq normalization, following the Greenleaf lab’s atac-seq analysis workflow.
Unfortunately this normalization method did not make the bigwig file look that much similar.
Could you let me know if there is better way to normalize these samples?
when I first used bamCoverage, I used following command.
$BAM_COVERAGE -b ${sample_name}.final.bam -o ${sample_name}.normTo1x.bw -bs 10 --normalizeTo1x $BAM_COVERAGE_NRM_TO_1X --blackListFileName $BLACKLIST_REGIONS --ignoreForNormalization chrX chrY --ignoreDuplicates -p $NTHREADS
Then the result was like the left panel of the picture above.
so I decided to use scale factor from DEseq normalization. I used following R code.
Finding Consensus regions and regions to count
myGRangesList_cov_cd14 <- GRangesList(myPeaks_cov_cd14)
reduced_cov_cd14 <- reduce(unlist(myGRangesList_cov_cd14))
consensusIDs_cov_cd14 <- paste0("consensus_", seq(1, length(reduced_cov_cd14)))
mcols(reduced_cov_cd14) <- do.call(cbind, lapply(myGRangesList_cov_cd14, function(x) (reduced_cov_cd14 %over% x) + 0))
reducedConsensus_cov_cd14 <- reduced_cov_cd14
mcols(reducedConsensus_cov_cd14) <- cbind(as.data.frame(mcols(reducedConsensus_cov_cd14)), consensusIDs_cov_cd14)
consensusIDs_cov_cd14 <- paste0("consensus_", seq(1, length(reducedConsensus_cov_cd14)))
return(reducedConsensus_cov_cd14)
consensusToCount_cov_cd14 <- reducedConsensus_cov_cd14
consensusToCount_cov_cd14 <- consensusToCount_cov_cd14[!consensusToCount_cov_cd14 %over% blklist & !seqnames(consensusToCount_cov_cd14) %in% "chrM"]
regionsToCount_cov_cd14 <- data.frame(GeneID = paste("ID", seqnames(consensusToCount_cov_cd14), start(consensusToCount_cov_cd14), end(consensusToCount_cov_cd14), sep = "_"), Chr = seqnames(consensusToCount_cov_cd14), Start = start(consensusToCount_cov_cd14), End = end(consensusToCount_cov_cd14), Strand = strand(consensusToCount_cov_cd14))
Counting..
fcResults_cov_cd14 <- featureCounts(bamsToCount_cov_cd14, annot.ext = regionsToCount_cov_cd14, isPairedEnd = TRUE, countMultiMappingReads = FALSE, maxFragLength = 100)
myCounts_cov_cd14 <- fcResults_cov_cd14$counts
normalization..
metaData_cov_cd14 <- data.frame(Group_cov_cd14, row.names = colnames(myCounts_cov_cd14))
atacDDS_cov_cd14 <- DESeqDataSetFromMatrix(myCounts_cov_cd14, metaData_cov_cd14, ~Group_cov_cd14, rowRanges = consensusToCount_cov_cd14)
atacDDS_cov_cd14 <- DESeq(atacDDS_cov_cd14)
sizeFactors(atacDDS_cov_cd14)
result is..
cov17 cov20 cov21 cov47 cov72 cov96 HD243
1.0550277 1.0587087 1.9685292 0.7116851 0.7409899 1.0541563 1.0426742
Using this size factors I ran bamCoverage again.
bamCoverage -b cov47_S1_merged.final.bam -o cov47_S1_merged.deseqSF.bw -bs 10 --scaleFactor 1.9685292 --ignoreForNormalization chrX chrY --ignoreDuplicates -p 2
Some samples became similar to each other, but overall, they still show huge variance in data range on IGV viewer.
How can I address this problem? Thanks!
Be aware that bamCoverage multiplies the raw coverage with the provided scale factor whereas DESeq2 divides it. You have to take the reciprocal value of the factors and feed this into bamCoverage. I also recommend inspecting normalization between pairs of samples using MA-plots. A properly normalized pair should have the majority of data points centered along y = 0. YOu obviously have notable differences in signal-to-noise ratio based on these plots so it is not surprising that the per-million scaling from bamCoverage performs poorly. DESeq2 (if used properly) should actually work well. Try with the reciprocal values.
I really appreciate your comment. I tried using the reciprocal value of size factor from DEseq in BamCoverage, but they look very similar to the previously normalized set of data, except for the data range is increased for all samples. Could using consensus peak regions as counting regions during DEseq normalization affect the values for the size factor?
Please see How to add images to a Biostars post to add your images properly. You'll need to use a host like imgbb, which allows public access to images via a URL. URLs that need authentication will not be useful.