6.1 years ago by
Tennessee
Here's a python solution. You need to have the pysam library installed (pip install pysam
):
import pysam
from sys import argv
bam_file = argv[1]
bed_file = argv[2]
bam = pysam.AlignmentFile(bam_file, 'rb')
with open(bed_file) as bf:
for line in bf:
line_parts = line.strip().split()
chr = line_parts[0]
start = int(line_parts[1])
end = int(line_parts[2])
read_data = bam.fetch(chr, start, end)
total_bases = 0
gc_bases = 0
for read in read_data:
seq = read.query_sequence
total_bases += len(seq)
gc_bases += len([x for x in seq if x == 'C' or x == 'G'])
if total_bases == 0:
gc_percent = 'No Reads'
else:
gc_percent = '{0:.2f}%'.format(float(gc_bases)/total_bases * 100)
print '{0}\t{1}'.format(line.strip(), gc_percent)
I didn't understand your comment to my question. If you want to print the total number of GC bases in the reads of a coordinate set, just change the last line to:
print '{0}\t{1}'.format(line.strip(), gc_bases)
You run the program like this:
python [script_name] [bam_file] [bed_file]
It prints the modified BED data to STDOUT.
•
link
modified 6.1 years ago
•
written
6.1 years ago by
Dan D ♦ 7.2k
It is important consider that such measurements will be biased by coverage
Should the cg_content column be expressed as a percentage of bases which are G or C?
better would be just a number - cg content - I now how to calculate separately from bam -
awk '{ n=length($10); print $10, gsub(/[GCCgcs]/,"",$10)/n;}' your.sam , but I need to calculate according BED file.