Question: Counting number of bases covered across multiple bam files
0
gravatar for ((Morpheus))
4.9 years ago by
United States
((Morpheus))0 wrote:

Hi all,

Long-time lurker, first-time poster.

Is there a tool that can be used to count the number of bases with coverage >= X in each of 3 input files? (That is, if chr1:1,000,000 has 30 reads at that position in each of bams A,B,C, then that position would be counted towards the total.)

Thanks in advance!

coverage multiple bam • 2.9k views
ADD COMMENTlink modified 4.9 years ago by Jorge Amigo11k • written 4.9 years ago by ((Morpheus))0

HTSeq-count would help you

http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html

ADD REPLYlink written 4.9 years ago by Manvendra Singh2.1k
1
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

You could do this with pysam (or htslib if you prefer C) and an mpileup. This will allow you to iterate over each base in the genome and count the number of reads in each sample (possibly with filtering metrics that you can specify) that cover that base. This is relatively quick. You could also use samtools depth on each file separately (internally, this is just a pileup) and then parse the results to count the bases. This might prove to be simpler to write in fact.

ADD COMMENTlink written 4.9 years ago by Devon Ryan92k

Thanks! This seems doable

ADD REPLYlink written 4.9 years ago by ((Morpheus))0

would you happen to have a small code sample that shows how to do this?

ADD REPLYlink written 21 months ago by steve2.3k
1
gravatar for Jorge Amigo
4.9 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I sometimes need to do something very similar, and I solve it by using bedtools. the rationale is very simple:

  1. extract a bed file per bam file with the regions where the bam file's positions are covered above a threshold (by generating a bedgraph file, filtering it by coverage, and merging all adjacent regions)
  2. intersect all the bed files previously obtained

the commands needed would be these ones:

THRESHOLD=30
foreach bamfile in *bam; do
  bedtoools genomecov -bga -ibam $bamfile \
  | awk -v t=$THRESHOLD '$4>=t' \
  | bedtools merge -i - \
  > ${bamfile/.bam/.above$THRESHOLD.bed}
done
cp `ls *.above$THRESHOLD.bed | head -1` bed.tmp
for bedfile in *.above$THRESHOLD.bed; do
  bedtools intersect -a bed.tmp -b $bedfile > bed.tmp.2
  mv bed.tmp.2 bed.tmp
done
mv bed.tmp above$THRESHOLD.intersect.bed
ADD COMMENTlink modified 4.0 years ago • written 4.9 years ago by Jorge Amigo11k

Aha, I really like this solution! I think I'll give this a shot tonight

ADD REPLYlink written 4.9 years ago by ((Morpheus))0
0
gravatar for Dan D
4.9 years ago by
Dan D6.8k
Tennessee
Dan D6.8k wrote:

Off the top of my head, if I wanted to solve this problem quickly I would merge the BAMs (you could also use samtools merge) and then run Picard's CollectWgsMetrics tool. What you'll get at the end is a text file showing the number of bases at each coverage level, along with some other nice details.

ADD COMMENTlink modified 4 days ago by RamRS24k • written 4.9 years ago by Dan D6.8k

This is an interesting idea, but how will I know if a base is covered in all 3 files? If a given position has 100 reads in the merged bam, is it possible to identify the extreme case where all of the reads are coming from the same sample?

ADD REPLYlink written 4.9 years ago by ((Morpheus))0

Well, that's a different question. I was addressing your original post with my answer. 

For this new query, i would use bedtools genomecov with each of your BAMs (and perhaps with the merged BAM as well) to ascertain the per-base coverage at each reference position.

If you were interested in specific locations of the genome, I would put those locations in a BED file and use samtools to create a BAM for just those regions. Feed that truncated BAM into the above tool and you should get your results quickly.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Dan D6.8k

Thanks for your response, and sorry if my original post was unclear!

Do you think the genomecov method is feasible with whole genomes?

ADD REPLYlink written 4.9 years ago by ((Morpheus))0

It certainly is. The documentation for the tool details several ways you can have the output presented. I'm not sure exactly what you're wanting to get out of it at this point; read through the short documentation for that tool. If the available parameters don't meet your needs, you might be able to accomplish your goal by piping the output into a filtering script or sed/awk command. If you have any questions, post a comment on this answer or your question and I'll do what I can to help. Precise explanations of what you want will be helpful to that end :)

ADD REPLYlink written 4.9 years ago by Dan D6.8k

Yes, of course. I have 3 whole genomes (normal, pre, and post treatment) and want to know the number of bases covered to 10X in all 3 files. I am somewhat familiar with genomecov, and know that I can pipe the output to a simple awk script that will keep only those bases with coverage > X and subsequently use some FNR==NR magic to compare those files. I'm worried that the genomecov outputs will be prohibitively massive, however, even if I gzip them. I guess I was hoping for a nifty tool that could read bams and count covered bases simultaneously...

ADD REPLYlink written 4.9 years ago by ((Morpheus))0

You can use join command in order to combine outputs from awk:
join <(... file1 | awk ...) <(... file2 | awk ...) | join - <(... file3 | awk)

This will avoid the need for storing output. If the column is not the first one, 'join' has options to set its number.

ADD REPLYlink written 4.9 years ago by lomereiter450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1022 users visited in the last hour