Dear anyone!
I am doing ATAC-seq analysis and my samples are not same quality and reads depth. So I read the 2018 Science pepar they provide us a very good way to both consider the sample quality and reads depth. What I want to do is to creat the ATAC-seq data tracks for in IGV. Below is what they have done :
Their puppose: To visualize our ATAC-seq data genome-wide we used ATAC-seq signal tracks that have been normalized by the number of reads in peaks.
For reads-in-peaks normalization: 1, The genome was binned into 100-bp intervals using “tile” in GenomicRanges of the chromosome sizes in R. . # This is to get the bins, which I think is quite small but dosen't matter;
2, The insertion sites (GenomicRanges) were then converted into a coverage run-length encoding using “coverage”. # Not very sure what they want to do;
3, Then, to determine the number of Tn5 insertions within each bin we constructed a “Views” object and calculated the sum in each bin with “ViewSums”. # Sum the total insertion number wihtin one bin.
And finally, the step confused me a lot is: We then normalized the total number of reads by a scale factor that converted all samples to a constant 30 million reads within peaks. This approach simultaneously normalizes samples by their quality and read depth, analogous to the reads in peaks normalization within a counts matrix.
This kind of normalization is somewhat like CPM, counts per million. But what's the scale factor here they used to scale the total number of reads? Like if we have 20 million reads in the peaks of sample1 and 40 million in sample2. Just simply for sample1: total number reads/(20/30), and sample2: total number reads/(40/30) ?
Any suggestion would be very grateful !