5.3 years ago by
One can do this with a mix of bedtools (version 2.18.2 and later) and basic scripting. The example is based on human, build hg19.
Make a BED file of windows (say 100Kb) across the genome.
bedtools makewindows -g hg19.chromsizes \
-w 100000 \
Get the observed BAM coverage for each window (your files must be position-sorted). The
-c option counts the number of reads overlapping each window.
bedtools intersect -a hg19.100Kb.windows.bed \
-b aln.bam \
-c -sorted \
Append GC content for each window. The sixth column of the output will be the GC content.
bedtools nuc -fi hg19.fa \
-bed hg19.100Kb.windows.counts.bedg \
| cut -f 1-4,6 \
Now you need to normalize by GC. The best way to do this is in two pases. First, loop through the file above and keep a list of the depths observed for each GC bin. I would break the GC into bins with two decimal precision. For example, tracking this, you might observe:
GC = 0.55. Depths = [235, 342, 288, etc.]
GC = 0.65. Depths = [110, 134, 155, etc.]
The counts for each bin should be roughly normally distributed. Thus, if you compute the mean and standard deviation of the counts for each GC bin above, then, in a second pass through the file, you can convert the observed count for each genomic window into a Z-score, reflecting the number of standard deviations that the count is from the mean count observed for all genomic windows with the same GC (the fifth column in the file we created in step 3). For example, Z=3 indicates that the count is 3 standard deviations higher than the mean for that GC bin. Negative 3 means 3 standard deviations lower. One would compute the Z score as follows: (observed_depth - mean_depth_for_GC) / stddev_for_GC.
Here is a little script I wrote a few months ago for this. It uses the pybedtools package to combine Python with the bedtools concepts outlined above. Hopefully this will provide a sense of how to go about the above strategy or modify it to your specific needs.