Question: How to plot coverage and depth statistics of a bam file
8
gravatar for krithika.gu
3.0 years ago by
krithika.gu120
United States
krithika.gu120 wrote:

Hello,

I am able to find tools like Bedtools, Samtools, etc that calculate the coverage and depth statistics for a BAM file.

But I am looking for a tool that will plot them for my BAM file. I tried a tool called BAMStats (http://bamstats.sourceforge.net/). This tool did not work on my BAM file - it errored out saying "too many reference files. My BAM file is from viral samples so there are many reference genomes.

I am able to run Bedtools and Samtools mpileup and get some numbers spit out, but I don't know how to create a plot out of it.

Any help on this will be useful.

Thanks

Krithika

 

 

ADD COMMENTlink modified 11 months ago by kirannbishwa01450 • written 3.0 years ago by krithika.gu120

Hi Deedee - thanks for the response. Can you many give some explanation on which tool in the Picard suite of tools can be used ?

ADD REPLYlink written 3.0 years ago by krithika.gu120

Definitely! Answer updated.

ADD REPLYlink written 3.0 years ago by Dan D6.0k

I did try out CollectAlignmentSummaryMetrics from Picard - I'm not exactly sure that is what I am looking for. I have also tried BedTools' genomeCov .

Basically what I am looking for is:

- Looking for a tool that will spit out some statistics about coverage and depth of BAM file.  I want to use this information to plot coverage and depth 

- OR find a tool that will directly spit out these plots. I found two tools that do this - BamStats and Qualimap. Both are not working on my data, probably because they are not human genome, but viral genomes.

ADD REPLYlink written 3.0 years ago by krithika.gu120
16
gravatar for kirannbishwa01
11 months ago by
United States
kirannbishwa01450 wrote:

This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. While this question is 2 years old but I am hoping there are many individuals trying to work out the solution to this problem, so I hope this will be helpful for others.

Identification of the depth of coverage is quite useful in 1) identifying the regions that might have potential paralogous alignment 2) finding the coverage at your regions of interest.

Here is a step by step guide to the problem/question.

Step #1) First identify the depth at each locus from a bam file.

I have found samtools depth option more useful in this regard, when coverage at each locus is desired.

samtools depth deduped_MA605.bam > deduped_MA605.coverage

The output file 'deduped_MA605.coverage' file will have 3 columns (Chr#, position and depth at that position) like below.

Chr  position depth (this header will be absent though)
1       3980        66
1       3981        68
1      3982     67
1      3983        67
1      3984     68

Step #2) Now, select the coverage (depth) by locus for each chromosome and/or regions

We can use the coverage file to plot it in R. But, the file is so large that it will suck up almost all the memory. It better to split the coverage by chromosome (or region of the chromosome if required).

To select the coverage for a particular chromosome (Chr#1 in my case)

awk '$1 == 1 {print $0}' deduped_MA605.coverage > chr1_MA605.coverage

To select coverage from chr #2

awk '$1 == 2 {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

If the chrosomosome has string characters it can be adjusted as

awk '$1 == "chr2" {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

Step #3) To plot the data in R this coverage file will need to be imported and the headers need to be added.

MA605.chr2 <- read.table("/media/everestial007/SeagateBackup4.0TB/New_Alignment_Set/05-B-deDupedReads-forScanIndel/chr2_MA605.coverage",
         header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)`

Note: The header of the column are automatically set as V1, V2 and V3.

To rename the headers

library(reshape) # loads the library to rename the column names

MA605.chr2<-rename(MA605.chr2,c(V1="Chr", V2="locus", V3="depth")) # renames the header

Now, plot the coverage by depth:

plot(MA605.chr2$locus, MA605.chr2$depth)

to get the wider/cleaner view of the plot use

library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16), scales=list(x=list(relation='same'), y=list(relation='same')), data=MA605.chr2, main="depth by locus - Chr2 (SampleMA605)")

The output plot looks like this: coverage depth by locus

Note:

  • We can see that there is a region around the middle and next to it that has a very high coverage. This likely suggests paralogous alignment.

  • Also, there is a gap in alignment. The region with no alignment is near (at) centromere where no consensus sequence has been found.

  • Reads may be later filtered by coverage.

Thanks,

ADD COMMENTlink modified 7 months ago • written 11 months ago by kirannbishwa01450

Step #2) can be done in one step like this

gawk '{/^[0-9]/{print >$1".coverag"}' ./deduped_MA605.coverag

ADD REPLYlink written 11 months ago by Medhat6.5k

This is really useful. However on the final step I get an error message from R:

Unexpected symbol in "library(lattice, pos=10) xyplot"

I can't see an error here but obviously there is one. If anyone has a good eye and has any suggestions to correct this that would be great...

ADD REPLYlink modified 7 months ago • written 7 months ago by Schmendrick10

Try it now. The full code was hidden due to formatting. It should work now.

ADD REPLYlink written 7 months ago by kirannbishwa01450

Would you guide how to plot same chromosome in one graph from multiple samples (bam) files?

Thanks.

ADD REPLYlink written 5 days ago by bioinfo870
7
gravatar for Dan D
3.0 years ago by
Dan D6.0k
Tennessee
Dan D6.0k wrote:

The Picard suite is your ticket.

EDIT:

Specifically, CollectAlignmentSummaryMetrics gives you your alignment percentage (and many other metrics) for a BAM file input.

If you have a interval table (something like a BED file) which describes genomic regions of interest, you can calculate depth of coverage statistics using CalculateHSSummaryMetrics. This tool will also give you detailed alignment metrics as well, so you can run it in place of CollectAlignmentSummaryMetrics.

 

More generally, there's also BedTools' genomeCov and GATK's DepthOfCoverage

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Dan D6.0k
1

This doesn't answer the question of plotting the coverage across genome.

ADD REPLYlink written 21 months ago by student-t380
2
gravatar for rbagnall
3.0 years ago by
rbagnall1.1k
Australia
rbagnall1.1k wrote:

Take a look at this blog post from Stephen Turner., which has a step by step guide to visualize coverage for targeted NGS (exome) experiments

http://gettinggeneticsdone.blogspot.com.au/2014/03/visualize-coverage-exome-targeted-ngs-bedtools.html

 

ADD COMMENTlink written 3.0 years ago by rbagnall1.1k
0
gravatar for Vivek
3.0 years ago by
Vivek1.7k
Denmark
Vivek1.7k wrote:

I doubt you'll find an out of the box plotting solution for this but if all you need is coverage information, you could try using samtools mpileup to generate read coverage at each position in the BAM and make a plot using R.

ADD COMMENTlink written 3.0 years ago by Vivek1.7k

Hi Vivek,

Would you elaborate more on this, if possible by showing the commands?

Thanks!

ADD REPLYlink written 6 days ago by bioinfo870
samtools mpileup your.bam | awk '{print $1"\t"$2"\t"$4}'

Gives you coverage at each chr:pos

There after you can use something like ggplot in R to plot a histogram of coverage.

ADD REPLYlink written 5 days ago by Vivek1.7k

Thanks Vivek. I have gone through various posts but still unclear whether to use samtools depth or samtools mpileup to calculate coverage of my multiple bam files. Secondly, as each file is very big, I am thinking to plot different chromosomes from multiple bam files in multiple plots, for e.g. if it consists of 23 chromosomes, I can plot 23 graphs where each graph contains same chromosomes coverage from multiple bam files. I hope I am in the right direction. But, if I want to see the coverage of all chromosomes in all bam files, what would you suggest?

ADD REPLYlink written 5 days ago by bioinfo870

I'm not sure what you mean by all BAM files, in general its best to view coverage over a single sample (BAM) rather than combine across samples as you'd otherwise miss the variation between samples. You should be able to plot all chromosomes at once even if its a WGS experiment but if it is too cumbersome to parse and you have limited computing power, you can probably sub-sample by taking the coverage every 10 bases and likely still get a good enough idea.

ADD REPLYlink written 5 days ago by Vivek1.7k

Ok. I have exome bam files of different individuals from same organism.

ADD REPLYlink modified 5 days ago • written 5 days ago by bioinfo870
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 797 users visited in the last hour