Question: Compute Binned Gc-Normalized Read Counts From Bam File
gravatar for Christian
6.3 years ago by
Cambridge, US
Christian2.9k wrote:

Binned read counts can be readily obtained from BAM files using for example 'samtools depth' or bedtools. However, I would be interested in getting binned read counts normalized by GC content, because GC content correction is clearly important to accurately quantify fragment abundance (see for example

I know that there are several solutions out there to do this, but they seem to be quite involved.

What is the most straightforward way (i.e. with the least lines of code) to get GC normalized read counts from BAM files? Any programming language would be ok. I am working on Illumina paired-end reads and human samples.

ADD COMMENTlink modified 6.3 years ago by Aaronquinlan11k • written 6.3 years ago by Christian2.9k
gravatar for Aaronquinlan
6.3 years ago by
United States
Aaronquinlan11k wrote:

One can do this with a mix of bedtools (version 2.18.2 and later) and basic scripting. The example is based on human, build hg19.

Make a BED file of windows (say 100Kb) across the genome.

bedtools makewindows -g hg19.chromsizes \
                     -w 100000 \

Get the observed BAM coverage for each window (your files must be position-sorted). The -c option counts the number of reads overlapping each window.

bedtools intersect -a \
                   -b aln.bam \
                   -c -sorted \

Append GC content for each window. The sixth column of the output will be the GC content.

bedtools nuc -fi hg19.fa \
             -bed \
| cut -f 1-4,6 \

Now you need to normalize by GC. The best way to do this is in two pases. First, loop through the file above and keep a list of the depths observed for each GC bin. I would break the GC into bins with two decimal precision. For example, tracking this, you might observe:

GC = 0.55. Depths = [235, 342, 288, etc.]
GC = 0.65. Depths = [110, 134, 155, etc.]

The counts for each bin should be roughly normally distributed. Thus, if you compute the mean and standard deviation of the counts for each GC bin above, then, in a second pass through the file, you can convert the observed count for each genomic window into a Z-score, reflecting the number of standard deviations that the count is from the mean count observed for all genomic windows with the same GC (the fifth column in the file we created in step 3). For example, Z=3 indicates that the count is 3 standard deviations higher than the mean for that GC bin. Negative 3 means 3 standard deviations lower. One would compute the Z score as follows: (observed_depth - mean_depth_for_GC) / stddev_for_GC.

Here is a little script I wrote a few months ago for this. It uses the pybedtools package to combine Python with the bedtools concepts outlined above. Hopefully this will provide a sense of how to go about the above strategy or modify it to your specific needs.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Aaronquinlan11k

This looks quite slick and useful, thanks! However, I can't get the 'bedtools intersect' to work, because it does not accept a bam file as the second file (-b parameter). I am using bedtools version 2.17.0. Also, any ideas how to get normalized read counts instead of Z scores?

ADD REPLYlink written 6.3 years ago by Christian2.9k

Sorry, you must use version 2.18.2 or later. Preferably 2.19, which was just released the other day. Will edit the answer.

ADD REPLYlink written 6.3 years ago by Aaronquinlan11k

I have used bedtools bam2bed result as a pipe and it works.

ADD REPLYlink written 5.9 years ago by zongzhi.liu0

It is possible, that bedtools nuc counting CG content from fasta - no from BAM file? Do you have any idea how to calculate cg content from BAM??? Thank you for an advice...

ADD REPLYlink written 5.5 years ago by Paul1.4k

Thank you for good command

But I can not run this command 

bedtools intersect -a \
                   -b aln.bam \
                   -c -sorted \

have error this below

***** WARNING: File 55-096.sorted_1.bam has inconsistent naming convention for record:

1    10000    10041    MG00HS09:556:C59KEACXX:7:2307:13289:74902/1    12    +

ERROR: Sort order was unspecified, and file 55-096.sorted_1.bam is not sorted lexicographically.

       Please re-reun with the -g option for a genome file.

       See documentation for details.

                                                            I'm bioinformatics student please suggest for me


ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by tanhugratay0

Can I compute mappability like GC content from the bam file?

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by fatima.m.zare20

I am wondering how can I remove GC content bias with Loess regression?

ADD REPLYlink written 3.3 years ago by fatima.m.zare20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour