Question: How to count the total aligned bases in BAM?
0
gravatar for vw
2.3 years ago by
vw20
vw20 wrote:

A quick question: I want to count the total aligned bases in a BAM file. I try to apply pysam (samtools) and pybedtools (bedtools) to do that. How can I make it? Thanks.

EDIT: I have solved this question with pybedtools and pandas:

 genomecov_base = bd.BedTool.genome_coverage(bam, dz=True) ## Depth per base on genome    

 gencov_perbase = pan.read_table(genomecov_base.fn, names=['chrom', 'position', 'depth'])
 gencov_allbase = (gencov_perbase.iloc[:, 2]).sum()    ## Total bases mapped on genome
pybedtools pysam bam python • 3.1k views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by vw20

Hi, this sounds like something I need as well. Does this count the actual number of reads that contain a particular base or just how many reads aligned at that position? For example at position 1 if I know I have 100 reads aligned there, would it tell me I have 50 A's, 25 a, 25 C? Something like that? Or does it just say that there are 100 bases aligned there? Cheers :)

ADD REPLYlink written 20 months ago by DNAngel60

Hi DNAngel, it does not return the per base output, but one option I would recommend for such a calculation is sambamba depth. This will return the per base output at each, although the files become quite large.

ADD REPLYlink modified 20 months ago • written 20 months ago by drkennetz500
5
gravatar for drkennetz
2.3 years ago by
drkennetz500
drkennetz500 wrote:

Hi, if it is available you could do a quick bedtools and awk calculation:

bedtools genomecov -ibam input.bam -bga > output_aln.bam
awk '{s+=$4}END{print s}' output_aln.bam

this will output the total of reads at each base from your reference genome in your bam file, and then awk will just some the number of reads at each base (giving you the total aligned bases).

I hope this helps!

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by drkennetz500

Thanks for your code. However, the parameter -a cannot be recognized. I cannot find it in the list also.

ADD REPLYlink written 2.3 years ago by vw20

I solved this question based on your suggestion. I have re-edited my question. Thank you!

ADD REPLYlink written 2.3 years ago by vw20

No problem! Glad to help. The -a flag means output all positions (even those with 0 coverage) which may or may not be useful downstream. You don't actually have to include it if you just want the aligned bases coverage, but sometimes it is useful if you want regions with 0 coverage too, in case you have a targeted region and you are trying to see how well it is covered.

ADD REPLYlink written 2.3 years ago by drkennetz500
1

Hi drkennetz, I tried your code again. I found that genomecov cannot work with -a(at least on my laptop). But coverage does. In order to obtain the depth of all positions, I tried -d or -dz with genomecov. They work fine.

ADD REPLYlink written 2.3 years ago by vw20

This was a mistake on my part! it should be:

bedtools genomecov -ibam input.bam -bga > output_aln.bam

the -a flag is for bedtools coverage, you are correct! The bga flag means bed graph all so it reports positions with 0 coverage as well. If you do not want 0 coverage positions you can use the flag -bg instead. I edited the answer to reflect the change. Thanks for catching this!

ADD REPLYlink written 2.3 years ago by drkennetz500

If the question is answered, please consider to mark drkennetz response as accepted answer to help others that have the same question in the future.

ADD REPLYlink written 2.3 years ago by ATpoint38k

This will sum the reads per base, while the question was about the number of aligned bases (i.e. each covered base should be counted once, regardless of how many reads aligns over it). In this case, just sum the length of the aligned contig (also, using just -bg should speed up the process):

bedtools genomecov -ibam input.bam -bg > output_aln.bg
awk '{ s+=($3-$2) } END { print s }' output_aln.bg
ADD REPLYlink written 10 months ago by benpo0

I am worried that this will significantly underestimate the number of aligned bases.

With -bga, each row has one coverage value for a range of bases, not each individual base. Illustrated in the manual here: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html You would need to multiply the column 4 integer by the number of basepairs spanned by columns 2-3 to get total number of bp.

I think you might need to edit the first command to use -d instead of -bga to get per-base coverage:

bedtools genomecov -ibam input.bam -d > output_aln.bam
awk '{s+=$4}END{print s}' output_aln.bam
ADD REPLYlink written 9 months ago by johnston.mike.j10
1
gravatar for finswimmer
2.3 years ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello vw,

beside other metrics you can get this value with CollectAlignmentSummaryMetrics from picard tools.

fin swimmer

ADD COMMENTlink written 2.3 years ago by finswimmer13k

Thanks for your help.

ADD REPLYlink written 2.3 years ago by vw20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 636 users visited in the last hour