Question: What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?
4
gravatar for Malachi Griffith
4.4 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith16k wrote:

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am not concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

  • samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
  • samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
  • bedtools coverage?
  • bedtools genomecov?
  • bedtools multicov?
  • bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

ADD COMMENTlink modified 3.2 years ago by Biostar ♦♦ 20 • written 4.4 years ago by Malachi Griffith16k
1

Of the bedtools options, genomecov is the one you want (see answer from Jordan). It is admittedly not the fastest - there are ways to speed it up, but we haven't yet done so. I'd be interested to know the runs times you face as a function of the number of reads.

ADD REPLYlink written 4.4 years ago by Aaronquinlan9.9k

All the answers were helpful but 'bedtools genomecov' was the method I wound up using in the end. 'samtools depth' has convenient output. I expect Pierre's solution is the probably the fastest.

ADD REPLYlink written 4.4 years ago by Malachi Griffith16k
7
gravatar for Pierre Lindenbaum
4.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum99k wrote:

Use the basic code for the Samtools API: http://samtools.sourceforge.net/sam-exam.shtml

change:

 if ((int)pos >= tmp->beg && (int)pos < tmp->end)

to

 if (n>20 && (int)pos >= tmp->beg && (int)pos < tmp->end)

compile: (sould be something like:)

gcc -O3 -o mytool mytool.c -I /path/to/samtools -L /path/to/samtools -lbam -lz

Edit: and you can run this code using one process per chromosome...

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Pierre Lindenbaum99k
3
gravatar for Jordan
4.4 years ago by
Jordan1.0k
Pittsburgh
Jordan1.0k wrote:

You could try bedtools genomeCoverageBed. It takes in bam file and reference genome as input and outputs the coverage stats.

Here is an example:

genomeCoverageBed -ibam file.bam -g human_g1k_v37.fasta > coverage.txt

The output file has 5 columns:

genome    0    1670110268    3101804739    0.538432
genome    1    696287280    3101804739    0.224478
genome    2    329358858    3101804739    0.106183
genome    3    144809487    3101804739    0.0466856
genome    4    66917607    3101804739    0.0215738
genome    5    32869034    3101804739    0.0105967
genome    6    17980502    3101804739    0.00579679
genome    7    10955385    3101804739    0.00353194
  • Column1: Genome/Chromosome name
    Column2: Depth
    Column3: Number of bases in the bam file with depth equal to Column2
    Column4: Genome Length
    Column5: Fraction of human genome with the depth equal to Column2

So, if you need coverage >= 5, in this case it would be 6.26% (you just add 0.538432+0.224478+0.106183+0.0466856+0.0215738 and subtract from 1).

I hope this is what you are looking for.

ADD COMMENTlink written 4.4 years ago by Jordan1.0k
1

This just takes 5-10 min depending on the system you have. Never exceeded 10 min for me.

ADD REPLYlink written 4.4 years ago by Jordan1.0k
1
gravatar for Zev.Kronenberg
4.4 years ago by
United States
Zev.Kronenberg11k wrote:

I pipe samtools depth to a C++ program I wrote to count depth X. From these counts I can quickly calculate the precent of bam covered at X or greater.

I also take missing sites into account.

chr1 1 10

chr 34 5

2-33 = 0;

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Zev.Kronenberg11k

Yeah, samtools depth is the method I was using. But for these massive BAM files, just running samtools depth seems to take a long time...

ADD REPLYlink written 4.4 years ago by Malachi Griffith16k

I've switched over to using sambamba depth, it's much faster than samtools depth and is also multithreaded. It also gives a parsed pileup like output with coverage by the type of base and has additional options to calculate coverage in overlapping widows.

ADD REPLYlink written 7 months ago by Vasisht130

I don't see an equivalent to depth or bedcov in the sambamba docs

ADD REPLYlink written 5 months ago by ariel.balter110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 841 users visited in the last hour