Calculating GC content from BAM file by BED file
4
3
Entering edit mode
7.8 years ago
Paul ★ 1.5k

Dear all,

I have bedg file created from bedtools -

.

.

.

3th column is information about overlapping reads at this region. I would like to add to 4th column information about GC content in reads from my BAM file. Do you have any idea how to do that? Thank you for any help.

So result should be:

cg content bam bed bedtools • 9.0k views
1
Entering edit mode

It is important consider that such measurements will be biased by coverage

0
Entering edit mode

Should the cg_content column be expressed as a percentage of bases which are G or C?

0
Entering edit mode

better would be just a number - cg content - I now how to calculate separately from bam -

awk '{ n=length($10); print$10, gsub(/[GCCgcs]/,"",10)/n;}' your.sam but I need to calculate according BED file. ADD REPLY 3 Entering edit mode 7.8 years ago Dan D 7.3k Here's a python solution. You need to have the pysam library installed (pip install pysam): import pysam from sys import argv bam_file = argv[1] bed_file = argv[2] bam = pysam.AlignmentFile(bam_file, 'rb') with open(bed_file) as bf: for line in bf: line_parts = line.strip().split() chr = line_parts[0] start = int(line_parts[1]) end = int(line_parts[2]) read_data = bam.fetch(chr, start, end) total_bases = 0 gc_bases = 0 for read in read_data: seq = read.query_sequence total_bases += len(seq) gc_bases += len([x for x in seq if x == 'C' or x == 'G']) if total_bases == 0: gc_percent = 'No Reads' else: gc_percent = '{0:.2f}%'.format(float(gc_bases)/total_bases * 100) print '{0}\t{1}'.format(line.strip(), gc_percent) I didn't understand your comment to my question. If you want to print the total number of GC bases in the reads of a coordinate set, just change the last line to: print '{0}\t{1}'.format(line.strip(), gc_bases) You run the program like this: python [script_name] [bam_file] [bed_file] It prints the modified BED data to STDOUT. ADD COMMENT 0 Entering edit mode Dear Dan, this is looks very nice, but I have error: AttributeError: 'module' object has no attribute 'AlignmentFile' Do you have any idea? Thank you for response.. ADD REPLY 0 Entering edit mode You need to install the pysam library. You can download it from github and install it manually, or if you have the Python package manager pip installed, you can type pip install pysam from the command line to install it. ADD REPLY 0 Entering edit mode Thank you for help - it is installed. But when I try to run I have following error: Traceback (most recent call last): File "cg_content.py", line 10, in <module> bam = pysam.AlignmentFile(bam_file.bam, 'rb') AttributeError: 'module' object has no attribute 'AlignmentFile' Do you know where could be a problem? ADD REPLY 0 Entering edit mode That's strange. Are you sure it's properly installed? How did you install it? The AlignmentFile() method is central to the pysam library. If it's not available it implies that the installation of the library is somehow incomplete or incorrect. ADD REPLY 0 Entering edit mode Dear Dan, thank you so much - I did not installed pyrex and Sphinx - right now it looks it work perfectly fine. May I ask - in some intervals there is no reads = 0 - I have at the end error bug: gc_percent = float(gc_bases)/total_bases * 100 ZeroDivisionError: float division by zero Is there any solution how to avoid in python to divide by zero? Very very thank you for your time and help!!! ADD REPLY 0 Entering edit mode Oops, hahaha. That was clumsy of me. Check my updated answer for the fix! ADD REPLY 0 Entering edit mode Dan, thank you SOO MUCH - it is now perfect! ADD REPLY 0 Entering edit mode Make sure you are using the most current version of pysam ADD REPLY 0 Entering edit mode Just wanted to add a link to this website, which can help debugging AlignmentFile() issues: ADD REPLY 0 Entering edit mode I am wondering why the GC content is calculated here for a BAM file and not for example from reference genome. Does this imply that if one has 100 samples, she has to calculate the GC content for each sample separately? ADD REPLY 4 Entering edit mode 7.8 years ago You could combine the bedtools map tool with awk to do this. Specifically, the example below uses map to count the number of overlapping reads for each interval (-c 10 -o count) and concatenate all of the nucleotide sequences in each overlapping BAM alignment (-c 10 -o collapse). Thus the fourth column of the output is the count of overlapping reads and the fifth column is the concatenated sequence from each read. Awk is then used to compute the GC fraction from this concatenated sequence. ➜ cat test.bed 20 0 1000000 20 1000000 2000000 20 2000000 3000000 ➜ bedtools map -a test.bed \ -b <(curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam) \ -c 10,10 \ -o count,concat \ | awk -v OFS="\t" '{n=length(5); gc=gsub("[gcGC]", "", $5); print$1,$2,$3,$4,gc/n}' 20 0 1000000 47199 0.45724 20 1000000 2000000 47929 0.445155 20 2000000 3000000 49902 0.445785 ADD COMMENT 0 Entering edit mode Wau respect - it works - very elegant solution!!! ADD REPLY 0 Entering edit mode I tried it and there is some problem with bedtools, i received and right information for chr 1-9 but for chr 10-21 I received only zeros at all. Do you know were could be problem. Segmentation on 60kb. ADD REPLY 1 Entering edit mode In order for the map tool to work correctly, all files must use the same chromosome sorting order. Based on what you describe, it sounds like the order of chromosomes in your BED file differes from the BAM file. Our next release will be able to detect these changes automagically. ADD REPLY 0 Entering edit mode It is ok now. There was a bug in older version... ADD REPLY 0 Entering edit mode I am really amazed by the simplicity and awesomeness of this solution. Just one change I would like to make it to work would be | awk -v OFS="\t" '{n=length($6); gc=gsub("[gcGC]", "", $6); print$1,$2,$3,$4,gc/n}' As$5 is the count of the number of reads that overlapped with the bed region

0
Entering edit mode

This doesn't seem to be the case.

Below is the output of the first portion of the one-liner before the AWK command.

chrX    301493  301678  2      ACGGCATACGCGATAACGCTATGTGTCTGGAGTTAAGAG
chrX    305864  306261  2      TGACTCGGCCTTAGCCTCCCGAGTAGCTGGGACTACAGA

Grabbing field #5 would be the correct move here. There isn't a field 6 at all.

2
Entering edit mode
7.8 years ago

Building off my previous answer (How To Calculate A Region‘ Gc Content In Reference Genome？？), you could first convert the BAM reads into FASTA with samtools and awk, then pipe that result to a second awk script that converts FASTA to a BED-like result that includes columns with per-FASTA element GC content:

$samtools view reads.bam \ | awk ' \ BEGIN { \ OFS="\t"; \ } \ { \ print ">"$3"-"($4-1)"-"($4+$9)"\n"$10 \
}' - \
| awk ' \
BEGIN { \
FS=""; \
cg=0; \
t=0; \
} \
{ \
if ($1 != ">") { \ for (i = 1; I <= NF; i++) { \ t++; \ if ($i ~ /[CGcg]/) { \
cg++;
} \
} \
} \
else { \
if (t > 0) { \
print e"\t"cg"\t"t"\t"(cg/t); \
cg = 0; \
t = 0; \
} \
h = substr(\$0, 2); \
split(h, a, "-"); \
e = a[1]"\t"a[2]"\t"a[3]; \
} \
} \
END { \
print e"\t"cg"\t"t"\t"(cg/t); \
}' - \

You'll need to modify this (untested) pipeline to get the exact columns you want, but I think this should get you about 95% of the way there.

0
Entering edit mode

Thank you Alex for comment and nice code, thats work perfectly fine, but output is in coordinate from my bam2fasta - I need calculate cg content in coordinate mention above. Probably I need to do bed intersection...

0
Entering edit mode

It looks like you modified your question since I answered it. However, you can take the result from this pipeline and do a bedmap --mean operation against a BED file of pre-made windows (say, 10 bases apart), to quickly and efficiently calculate the average GC content over each window.

1
Entering edit mode
7.8 years ago
EagleEye 7.3k
After converting your BAM file into FASTA. This gives the GC percentage and CpG density of each regions in the bed file providing the converted FASTA file as genome input to this tool, http://homer.salk.edu/homer/ngs/annotation.html Note: Use option -CpG.