We want to do a meta-analysis on previously analyzed ATAC-seq data from different cell types. It has been noticed that those ATAC-seq libraries are sequenced at different batches (some libraries are 100bp PE while some are 50 bp PE). The library sizes of dedup read are varied a lot too.
cell type biological replicates dedup read number read length cell1 rep1 3,031,071 50bp cell1 rep2 4,964,107 50bp cell2 rep1 596,001 100bp cell2 rep2 1,031,791 100bp cell3 rep1 1,303,817 50bp cell3 rep2 1,030,124 50bp cell4 rep1 979,684 100bp cell4 rep2 720,355 100bp cell5 rep1 563,108 100bp cell5 rep2 537,466 100bp
Based on the common sense, more dedup reads generate more peaks (assuming ATAC-seq has very consistent noise background across libraries). I want to down-sample dedup reads for peak calling (macs2) to make library size consistent from sample to sample. However, since we have read length variations among samples, making dedup read number consistent is not enough, since longer reads will have higher genome coverage than shorter reads when read number are the same. Thus, I am wondering when I do donwsampling, should I take read length into consideration?
For example, if I want to finally sub-sample 100bp library dedup read count as 550K reads, should I sub-sample 50bp library dedup read count to 1100K reads instead?
I am also open to any better solutions for this problem.