Question: Coverage of reference from the BAM file
0
gravatar for brs1111
19 months ago by
brs111110
brs111110 wrote:

I have a SAM/BAM file and I would like to calculate the percentage of the reference sequences are aligned with the reads. Is there a tool that can help to calculate the coverage?

bam sam alignment • 855 views
ADD COMMENTlink written 19 months ago by brs111110

what did you find so far ?

ADD REPLYlink written 19 months ago by Pierre Lindenbaum118k

I tried samtools mpileup but it got aborted as I run without -r chr3:1,000-2,000 and single BAM file. Also I run bamtools coverage -in in.BAM -out output_reads_coverage.txt, but not sure how can I calculate the percentage of coverage from output file.

ADD REPLYlink written 19 months ago by brs111110

What do you mean by the 'percentage of coverage' exactly?

samtools flagstat will show the percentage of your reads that have aligned to the reference genome. BEDTools, on the other hand, has numerous functions that allow you to determine depth of coverage and read depth of these reads that have aligned [to the reference].

ADD REPLYlink written 19 months ago by Kevin Blighe41k

Hi Kevin, Thanks for the reply. I used samtools flagstat for other statistics. Suppose I have one contigs of length 100 and two reads of length 20 and 30 respectively. Let the first read is aligned from position 10 to 30 and the second read is aligned from position 20 to 50. So the percentage of coverage is 40% here as all bases from position 10 to 50 are aligned.

Does BEDTools generate this percentage?

ADD REPLYlink written 19 months ago by brs111110
1

Yes, that's what I thought that you meant. It's not a traditional metric that 's observed but I have to admit that it's interesting and I think that I'll calculate it in my analyses in the future. Thanks!

Anyway, I have been able to do this just now indirectly using BEDTools and looking over the hg38 reference genome:

Bases at 0 read depth:

bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1

Bases at >0 read depth:

bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1

The genomecov function of BEDTools normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:

chr1    0   11794364    0
chr1    11794364    11794488    3
chr1    11794488    11794490    5
chr1    11794490    11794514    6
chr1    11794514    11794639    3
chr1    11794639    14022492    0
chr1    14022492    14022535    1
chr1    14022535    97082336    0
chr1    97082336    97082442    1
chr1    97082442    97082486    2
chr1    97082486    97082593    1
chr1    97082593    97305303    0
chr1    97305303    97305377    2

I then use the awk commands to calculate total bases at 0 read depth and total bases >0 read depth. After that, the calculation for percentage can easily be done. For my sample (a small targeted sequence panel), I ended up with:

34312 non-zero bases
3209251793 zero bases
(34312 / (3209251793 + 34312 )) * 100 = 0.0011% genome coverage

It took less than a minute to run with my commands above.

If you want to automate this in a pipeline, the following code would work (note that in BASH decimal points need to be handled specially with the bc command):

zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

Like I said, I think that this is a useful metric and you could perhaps turn this into a tutorial(?), if Pierre agreed too.

ADD REPLYlink modified 19 months ago • written 19 months ago by Kevin Blighe41k
1

Thanks Kevin. It works really great. I have just one more question (could be a very basic question as I am new to this sort of analysis). Actually, I have a reference file that contains almost 60K sequences and only 50% of them are aligned with reads. I would like to find the coverage only for the sequences in reference that are aligned. The coverage percentage that I calculated is very low and it could be because of zero bases of all sequences.

ADD REPLYlink written 19 months ago by brs111110

Hi again friend,

If you just want the coverage of the aligned reads, then you just change the -bga parameter to -bg, and this will not then output the unaligned reads (with 0 read depth).

There are other options here using bedtools genomecov: http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

Let me know if this answers your question!

ADD REPLYlink written 19 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour