How to count the total aligned bases in BAM?
2
1
Entering edit mode
4.7 years ago
vw ▴ 30

A quick question: I want to count the total aligned bases in a BAM file. I try to apply pysam (samtools) and pybedtools (bedtools) to do that. How can I make it? Thanks.

EDIT: I have solved this question with pybedtools and pandas:

 genomecov_base = bd.BedTool.genome_coverage(bam, dz=True) ## Depth per base on genome

gencov_perbase = pan.read_table(genomecov_base.fn, names=['chrom', 'position', 'depth'])
gencov_allbase = (gencov_perbase.iloc[:, 2]).sum()    ## Total bases mapped on genome

bam pysam pybedtools python • 6.2k views
0
Entering edit mode

Hi, this sounds like something I need as well. Does this count the actual number of reads that contain a particular base or just how many reads aligned at that position? For example at position 1 if I know I have 100 reads aligned there, would it tell me I have 50 A's, 25 a, 25 C? Something like that? Or does it just say that there are 100 bases aligned there? Cheers :)

0
Entering edit mode

Hi DNAngel, it does not return the per base output, but one option I would recommend for such a calculation is sambamba depth. This will return the per base output at each, although the files become quite large.

5
Entering edit mode
4.7 years ago
drkennetz ▴ 560

Hi, if it is available you could do a quick bedtools and awk calculation:

bedtools genomecov -ibam input.bam -bga > output_aln.bam
awk '{s+=4}END{print s}' output_aln.bam  this will output the total of reads at each base from your reference genome in your bam file, and then awk will just some the number of reads at each base (giving you the total aligned bases). I hope this helps! ADD COMMENT 0 Entering edit mode Thanks for your code. However, the parameter -a cannot be recognized. I cannot find it in the list also. ADD REPLY 0 Entering edit mode I solved this question based on your suggestion. I have re-edited my question. Thank you! ADD REPLY 0 Entering edit mode No problem! Glad to help. The -a flag means output all positions (even those with 0 coverage) which may or may not be useful downstream. You don't actually have to include it if you just want the aligned bases coverage, but sometimes it is useful if you want regions with 0 coverage too, in case you have a targeted region and you are trying to see how well it is covered. ADD REPLY 1 Entering edit mode Hi drkennetz, I tried your code again. I found that genomecov cannot work with -a(at least on my laptop). But coverage does. In order to obtain the depth of all positions, I tried -d or -dz with genomecov. They work fine. ADD REPLY 0 Entering edit mode This was a mistake on my part! it should be: bedtools genomecov -ibam input.bam -bga > output_aln.bam  the -a flag is for bedtools coverage, you are correct! The bga flag means bed graph all so it reports positions with 0 coverage as well. If you do not want 0 coverage positions you can use the flag -bg instead. I edited the answer to reflect the change. Thanks for catching this! ADD REPLY 0 Entering edit mode If the question is answered, please consider to mark drkennetz response as accepted answer to help others that have the same question in the future. ADD REPLY 0 Entering edit mode This will sum the reads per base, while the question was about the number of aligned bases (i.e. each covered base should be counted once, regardless of how many reads aligns over it). In this case, just sum the length of the aligned contig (also, using just -bg should speed up the process): bedtools genomecov -ibam input.bam -bg > output_aln.bg awk '{ s+=(3-2) } END { print s }' output_aln.bg  ADD REPLY 0 Entering edit mode I am worried that this will significantly underestimate the number of aligned bases. With -bga, each row has one coverage value for a range of bases, not each individual base. Illustrated in the manual here: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html You would need to multiply the column 4 integer by the number of basepairs spanned by columns 2-3 to get total number of bp. I think you might need to edit the first command to use -d instead of -bga to get per-base coverage: bedtools genomecov -ibam input.bam -d > output_aln.bam awk '{s+=4}END{print s}' output_aln.bam

1
Entering edit mode
4.7 years ago

Hello vw,

beside other metrics you can get this value with CollectAlignmentSummaryMetrics from picard tools.

fin swimmer

0
Entering edit mode