Question: Calculating GC content from BAM file by BED file
3
gravatar for Paul
4.0 years ago by
Paul1.2k
European Union
Paul1.2k wrote:

Dear all,

I have bedg file created from bedtools -

chr1 1  10 num_of_reads

chr1 10 20 num_of_reads

chr1 20 30 num_of_reads

.

.

.

3th column is information about overlapping reads at this region. I would like to add to 4th column information about GC content in reads from my BAM file. Do you have any idea how to do that? Thank you for any help.

So result should be:

ch start stop num_of_reads cg_content
bedtools bam bed cg content • 4.8k views
ADD COMMENTlink modified 4.0 years ago by Aaronquinlan10k • written 4.0 years ago by Paul1.2k
1

It is important consider that such measurements will be biased by coverage 

ADD REPLYlink written 4.0 years ago by Aaronquinlan10k

Should the cg_content column be expressed as a percentage of bases which are G or C?

ADD REPLYlink written 4.0 years ago by Dan D6.6k

better would be just a number - cg content - I now how to calculate separately from bam -

awk '{ n=length($10); print $10, gsub(/[GCCgcs]/,"",$10)/n;}'   your.sam , but I need to calculate according BED file.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Paul1.2k
3
gravatar for Dan D
4.0 years ago by
Dan D6.6k
Tennessee
Dan D6.6k wrote:

Here's a python solution. You need to have the pysam library installed (pip install pysam):

import pysam
from sys import argv

bam_file = argv[1]
bed_file = argv[2]

bam = pysam.AlignmentFile(bam_file, 'rb')
with open(bed_file) as bf:
    for line in bf:
        line_parts = line.strip().split()
        chr = line_parts[0]
        start = int(line_parts[1])
        end = int(line_parts[2])
        read_data = bam.fetch(chr, start, end)
        total_bases = 0
        gc_bases = 0
        for read in read_data:
            seq = read.query_sequence
            total_bases += len(seq)
            gc_bases += len([x for x in seq if x == 'C' or x == 'G'])
        if total_bases == 0:
            gc_percent = 'No Reads'
        else:
            gc_percent = '{0:.2f}%'.format(float(gc_bases)/total_bases * 100)
        print '{0}\t{1}'.format(line.strip(), gc_percent)

I didn't understand your comment to my question. If you want to print the total number of GC bases in the reads of a coordinate set, just change the last line to:

print '{0}\t{1}'.format(line.strip(), gc_bases)

You run the program like this:

python [script_name] [bam_file] [bed_file]

It prints the modified BED data to STDOUT.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Dan D6.6k
Dear Dan, this is looks very nice, but I have error:

AttributeError: 'module' object has no attribute 'AlignmentFile'


Do you have any idea? Thank you for response..

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Paul1.2k

You need to install the pysam library. You can download it from github and install it manually, or if you have the Python package manager `pip` installed, you can type `pip install pysam` from the command line to install it.

ADD REPLYlink written 4.0 years ago by Dan D6.6k

Thank you for help - it is installed. But when I try to run I have following error:

Traceback (most recent call last):
  File "cg_content.py", line 10, in <module>
    bam = pysam.AlignmentFile(bam_file.bam, 'rb')
AttributeError: 'module' object has no attribute 'AlignmentFile'

Do you know where could be a problem?

ADD REPLYlink written 4.0 years ago by Paul1.2k

That's strange. Are you sure it's properly installed? How did you install it? The AlignmentFile() method is central to the pysam library. If it's not available it implies that the installation of the library is somehow incomplete or incorrect.

ADD REPLYlink written 4.0 years ago by Dan D6.6k

Dear Dan, thank you so much - I did not installed pyrex and Sphinx - right now it looks it work perfectly fine. May I ask - in some intervals there is no reads = 0 - I have at the end error bug:

gc_percent = float(gc_bases)/total_bases * 100
ZeroDivisionError: float division by zero

Is there any solution how to avoid in python to divide by zero? Very very thank you for your time and help!!!

ADD REPLYlink written 4.0 years ago by Paul1.2k

Oops, hahaha. That was clumsy of me. Check my updated answer for the fix!

ADD REPLYlink written 4.0 years ago by Dan D6.6k

Dan, thank you SOO MUCH - it is now perfect!

ADD REPLYlink written 4.0 years ago by Paul1.2k

Make sure you are using the most current version of pysam

ADD REPLYlink written 3.8 years ago by lkmklsmn860

Just wanted to add a link to this website, which can help debugging AlignmentFile() issues:

ADD REPLYlink written 2.8 years ago by dvanic240

I am wondering why the GC content is calculated here for a BAM file and not for example from reference genome. Does this imply that if one has 100 samples, she has to calculate the GC content for each sample separately?

ADD REPLYlink written 20 months ago by Dataman260
2
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Building off my previous answer ( A: How To Calculate A Region‘ Gc Content In Reference Genome?? ), you could first convert the BAM reads into FASTA with samtools and awk, then pipe that result to a second awk script that converts FASTA to a BED-like result that includes columns with per-FASTA element GC content:

$ samtools view reads.bam \
    | awk ' \
    BEGIN { \
        OFS="\t"; \
    } \
    { \
        print ">"$3"-"($4-1)"-"($4+$9)"\n"$10 \
    }' - \
    | awk ' \
    BEGIN { \
        FS=""; \
        cg=0; \
        t=0; \
    } \
    { \
        if ($1 != ">") { \
            for (i = 1; i <= NF; i++) { \
                t++; \
                if ($i ~ /[CGcg]/) { \
                    cg++;
                } \
            } \
        } \
        else { \
            if (t > 0) { \
                print e"\t"cg"\t"t"\t"(cg/t); \
                cg = 0; \
                t = 0; \
            } \
            h = substr($0, 2); \
            split(h, a, "-"); \
            e = a[1]"\t"a[2]"\t"a[3]; \
        } \
    } \
    END { \
        print e"\t"cg"\t"t"\t"(cg/t); \
    }' - \
    > answer.bed

You'll need to modify this (untested) pipeline to get the exact columns you want, but I think this should get you about 95% of the way there.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds26k

Thank you Alex for comment and nice code, thats work perfectly fine, but output is in coordinate from my bam2fasta - I need calculate cg content in coordinate mention above. Probably I need to do bed intersection...

ADD REPLYlink written 4.0 years ago by Paul1.2k

It looks like you modified your question since I answered it. However, you can take the result from this pipeline and do a bedmap --mean operation against a BED file of pre-made windows (say, 10 bases apart), to quickly and efficiently calculate the average GC content over each window.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds26k
2
gravatar for Aaronquinlan
4.0 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

You could combine the bedtools `map` tool with awk to do this.  Specifically, the example below uses `map` to count the number of overlapping reads for each interval (`-c 10 -o count`) _and_ concatenate all of the nucleotide sequences in each overlapping BAM alignment (`-c 10 -o collapse`).  Thus the fourth column of the output is the count of overlapping reads and the fifth column is the concatenated sequence from each read. Awk is then used to compute the GC fraction from this concatenated sequence.

 

➜  cat test.bed
20    0    1000000
20    1000000    2000000
20    2000000    3000000
➜  bedtools map -a test.bed \
                 -b <(curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam) \
                 -c 10,10 \
                 -o count,concat \
| awk -v OFS="\t" '{n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}'

20    0    1000000    47199    0.45724
20    1000000    2000000    47929    0.445155
20    2000000    3000000    49902    0.445785
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Aaronquinlan10k

Wau respect - it works - very elegant solution!!!

ADD REPLYlink written 4.0 years ago by Paul1.2k

I tried it and there is some problem with bedtools, i received and right information for chr 1-9 but for chr 10-21 I received only zeros at all. Do you know were could be problem. Segmentation on 60kb.

ADD REPLYlink written 4.0 years ago by Korsocius90
1

In order for the map tool to work correctly, all files must use the same chromosome sorting order. Based on what you describe, it sounds like the order of chromosomes in your BED file differes from the BAM file. Our next release will be able to detect these changes automagically.

ADD REPLYlink written 4.0 years ago by Aaronquinlan10k

It is ok now. There was a bug in older version...

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Korsocius90
1
gravatar for EagleEye
4.0 years ago by
EagleEye6.0k
Sweden
EagleEye6.0k wrote:
After converting your BAM file into FASTA. This gives the GC percentage and CpG density of each regions in the bed file providing the converted FASTA file as genome input to this tool, http://homer.salk.edu/homer/ngs/annotation.html Note: Use option -CpG.
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by EagleEye6.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1598 users visited in the last hour