Compute Binned Gc-Normalized Read Counts From Bam File
1
8
Entering edit mode
10.2 years ago
Christian ★ 3.0k

Binned read counts can be readily obtained from BAM files using for example 'samtools depth' or bedtools. However, I would be interested in getting binned read counts normalized by GC content, because GC content correction is clearly important to accurately quantify fragment abundance (see for example http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378858/).

I know that there are several solutions out there to do this, but they seem to be quite involved.

What is the most straightforward way (i.e. with the least lines of code) to get GC normalized read counts from BAM files? Any programming language would be ok. I am working on Illumina paired-end reads and human samples.

gc normalization depth-of-coverage • 11k views
ADD COMMENT
14
Entering edit mode
10.2 years ago

One can do this with a mix of bedtools (version 2.18.2 and later) and basic scripting. The example is based on human, build hg19.

Make a BED file of windows (say 100Kb) across the genome.

bedtools makewindows -g hg19.chromsizes \
                     -w 100000 \
> hg19.100Kb.windows.bed

Get the observed BAM coverage for each window (your files must be position-sorted). The -c option counts the number of reads overlapping each window.

bedtools intersect -a hg19.100Kb.windows.bed \
                   -b aln.bam \
                   -c -sorted \
>  hg19.100Kb.windows.counts.bedg

Append GC content for each window. The sixth column of the output will be the GC content.

bedtools nuc -fi hg19.fa \
             -bed hg19.100Kb.windows.counts.bedg \
| cut -f 1-4,6 \
>  hg19.100Kb.windows.counts.gc.bedg

Now you need to normalize by GC. The best way to do this is in two pases. First, loop through the file above and keep a list of the depths observed for each GC bin. I would break the GC into bins with two decimal precision. For example, tracking this, you might observe:

GC = 0.55. Depths = [235, 342, 288, etc.]
GC = 0.65. Depths = [110, 134, 155, etc.]

The counts for each bin should be roughly normally distributed. Thus, if you compute the mean and standard deviation of the counts for each GC bin above, then, in a second pass through the file, you can convert the observed count for each genomic window into a Z-score, reflecting the number of standard deviations that the count is from the mean count observed for all genomic windows with the same GC (the fifth column in the file we created in step 3). For example, Z=3 indicates that the count is 3 standard deviations higher than the mean for that GC bin. Negative 3 means 3 standard deviations lower. One would compute the Z score as follows: (observed_depth - mean_depth_for_GC) / stddev_for_GC.

Here is a little script I wrote a few months ago for this. It uses the pybedtools package to combine Python with the bedtools concepts outlined above. Hopefully this will provide a sense of how to go about the above strategy or modify it to your specific needs.

https://github.com/arq5x/codachrom/blob/master/windowed_coverage.py

ADD COMMENT
0
Entering edit mode

This looks quite slick and useful, thanks! However, I can't get the 'bedtools intersect' to work, because it does not accept a bam file as the second file (-b parameter). I am using bedtools version 2.17.0. Also, any ideas how to get normalized read counts instead of Z scores?

ADD REPLY
0
Entering edit mode

Sorry, you must use version 2.18.2 or later. Preferably 2.19, which was just released the other day. Will edit the answer.

ADD REPLY
0
Entering edit mode

I have used bedtools bam2bed result as a pipe and it works.

ADD REPLY
0
Entering edit mode

It is possible, that bedtools nuc counting CG content from fasta - no from BAM file? Do you have any idea how to calculate cg content from BAM??? Thank you for an advice...

ADD REPLY
0
Entering edit mode

Thank you for good command. But I cannot run this command

bedtools intersect -a hg19.100Kb.windows.bed \
                   -b aln.bam \
                   -c -sorted \
>  hg19.100Kb.windows.counts.bedg

I have this error below

***** WARNING: File 55-096.sorted_1.bam has inconsistent naming convention for record:
1    10000    10041    MG00HS09:556:C59KEACXX:7:2307:13289:74902/1    12    +
ERROR: Sort order was unspecified, and file 55-096.sorted_1.bam is not sorted lexicographically.
       Please re-reun with the -g option for a genome file.
       See documentation for details.

I'm bioinformatics student please suggest for me

ADD REPLY
0
Entering edit mode

Can I compute mappability like GC content from the bam file?

ADD REPLY
0
Entering edit mode

I am wondering how can I remove GC content bias with Loess regression?

ADD REPLY

Login before adding your answer.

Traffic: 3221 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6