Question: Calculate percentage of bases covered less than nX in targeted sequencing experiment
0
gravatar for RamRS
3.8 years ago by
RamRS18k
Houston, TX
RamRS18k wrote:

Hi,

I'm looking to calculate some coverage statistics on the alignment in my targeted sequencing experiment. I have a BAM file and a BED file with the target regions. I wish to determine the percentage of bases covered above 30X, 40X, 50X etc. How can I best do this?

Also, we sequenced a LOT of samples in a single lane, so we had to repeat the experiment multiple times to get sufficient coverage. Will the above procedure help me determine if I have reached sufficient coverage? I wish to have at least 70X across the target region, but I can make do with a small percentage of low coverage regions.

I already have stats on average coverage per sample (by dividing number of bases sequenced on target by total length of target region)

ADD COMMENTlink modified 11 weeks ago by toni2.1k • written 3.8 years ago by RamRS18k
3
gravatar for Devon Ryan
3.8 years ago by
Devon Ryan85k
Freiburg, Germany
Devon Ryan85k wrote:

Have you tried just piping samtools depth to awk or do you want a more complicated analysis than that?

ADD COMMENTlink written 3.8 years ago by Devon Ryan85k

Me to samtools: "I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve."

Thank you, Devon. I'll explore samtools!

ADD REPLYlink written 3.8 years ago by RamRS18k

Devon, samtools depth | awk seems like a really speedy option in my case. I have 30 BAM files and what samtools depth did under 24 hours (all samples put together), GATK's DepthOfCoverage is unable to do for even 5 chromosomes of a single sample.

I haven't tried Brian's tool yet, because I'm not sure if it works with targeted sequencing. I'd prefer if I did not have to do the filtration step after the depth analysis.

ADD REPLYlink written 3.7 years ago by RamRS18k

Thanks for getting back to us on that, it's good to know. I'm not surprised that GATK's tool is slow, their walker methods are generally absurdly slow. BisSNP is a modified version of GATK for methylation realignment/calling and it's around 10x slower on processing a file than my tool, which uses htslib, for example.

Let us know if you end up using Brian's tool too. At the end of the day, the speed here should just be limited by disk IO.

ADD REPLYlink written 3.7 years ago by Devon Ryan85k
1

I'm using my org's HPC, and I/O is great + not in my control, but yeah, GATK seems quite slow and consumes time that I do not have.

I might try Brian's tool today and check its performance for my specific use case, but I must also focus on getting the work done. samtools depth has given me granular (per base depth) data, and I need to stratify this + check across multiple sequencing runs for consistently low covered regions.

My ultimate aim is to determine if another run of sequencing can reduce the fraction of low covered regions, and I guess I should consult with my sequencing team on how the biology and sample prep affects this as well.

ADD REPLYlink written 3.7 years ago by RamRS18k
1

Focus on getting the work done? Pfft, what's that about :P

ADD REPLYlink written 3.7 years ago by Devon Ryan85k

Exploration is good, but not at the cost of timely results, no? "I'm working on it" can buy only so much time :P

ADD REPLYlink written 3.7 years ago by RamRS18k
1

Agreed :) If only there were infinite time!

ADD REPLYlink written 3.7 years ago by Devon Ryan85k
2
gravatar for Varun Gupta
3.8 years ago by
Varun Gupta1.0k
United States
Varun Gupta1.0k wrote:

I think GATK has depthofcoverage tool to do exactly what you want and may be then filter out 30X , 40X with awk

ADD COMMENTlink written 3.8 years ago by Varun Gupta1.0k

I stumbled across this too, along with samtools | awk, but I wished to consult with folks here first. I feel that discussions, and not questions are the essence of our interactions here :)

Thank you for recommending this - I'll give it a try!

ADD REPLYlink written 3.8 years ago by RamRS18k

I suspect that Brian's tool will prove the most convenient, they typically do. In any case, I'm curious to hear what turns out to be the best solution. BTW, with the samtools route, do make sure to use -q 1 or higher so you don't double count cases where paired-end reads overlap.

ADD REPLYlink written 3.8 years ago by Devon Ryan85k
1

My tool is indeed convenient, but does double-count overlapping paired reads, and uses much more memory (2 bytes/base) than is necessary when processing sorted data.  However, double-counting the overlapped region is not necessarily incorrect; they are two independent observations of the same base, from the perspective of sequencing error.  I would have much higher confidence in a base call doubly-covered by 10 overlapping read pairs than one covered by 10 non-overlapping reads.  Not as much as 20 totally independent non-overlapping reads; somewhere in between, but more on the high side.  As such, I prefer overlapped regions double-counted when calculating coverage.  Though when variant calling, the best approach (IMO) is to track the number of unique pair IDs that indicate a given variant, which prevents double-counting.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Brian Bushnell16k

Yeah, one of the things that samtools will do is adjust the quality for bases covered by overlapping reads in a pair (well, it at least penalizes them if they disagree and will in any case convert the phred score of the base coming from read #2 to 0). I should actually check to see if it bumps the scores if the bases agree (I don't think so, but that'd be a good option to add).

ADD REPLYlink written 3.8 years ago by Devon Ryan85k

Sorry, I edited my response to better reflect my position that calling variants is best done when excluding the X bases closest to the ends (where X is around 8) to allow more specificity with regards to indels versus substitutions.  When looking for SNPs only, truncating the ends of both reads to the point that they no longer overlap, and adjusting the qualities accordingly, is probably best, but it reduces the ability to correctly call indels near the middle of the pair.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Brian Bushnell16k
1
gravatar for Brian Bushnell
3.8 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

If you want to use BBTools:

pileup.sh in=mapped.bam hist=histogram.txt

That gives the number of bases at every coverage level.  To process bam files, samtools must be in the path; sam files don't need it.  The input does not need to be sorted.

ADD COMMENTlink modified 3.7 years ago by RamRS18k • written 3.8 years ago by Brian Bushnell16k

Thank you for the detailed response outlining all possible nuances in the approach. I'll try this and get back to you on how it performs!

ADD REPLYlink written 3.8 years ago by RamRS18k

Brian, the tool is really fast, thank you!

Quick question: My experiment involves targeted sequencing, so is there any problem with the program not accepting a BED file with my target regions?

ADD REPLYlink written 3.7 years ago by RamRS18k
1

Pileup's histogram will give you coverage statistics over the entire reference.  It can also give you per-base or binned coverage using the "basecov=" or "bincov=" flags.  It cannot currently calculate the coverage distribution restricted to the area specified in a bed file, so if that's what you want, you'd have to generate the per-base coverage and process it in conjunction with the bed file yourself.

However, I'm always interested in adding options.  Can you explain exactly how you want the bed file to be processed?  As in, am I correct in thinking that you just want the coverage distribution of the ranges designated in the bed file?

ADD REPLYlink written 3.7 years ago by Brian Bushnell16k

Thank you for the detailed response, Brian. You are correct - my usage scenario is calculating depth of coverage at particular stretches of sequence as specified in the BED file, where each stretch is an enriched region targeted by our probe. This can currently be done using the -b option in samtools depth and the -L option in GATK's DepthOfCoverage.

BTW, does your code have a web user manual listing all options by any chance?

ADD REPLYlink written 3.7 years ago by RamRS18k

All of the documentation is currently in the shellscripts.  When you run a shellscript with no arguments it will print all of the options for that tool.

ADD REPLYlink written 3.7 years ago by Brian Bushnell16k
0
gravatar for toni
11 weeks ago by
toni2.1k
Lyon
toni2.1k wrote:

I know it's an old thread but as I have tested this tool recently, I would definitely use Mosdepth to answer this question.

It's fast and its --thresholds option does exactly this.

Anthony

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by toni2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2165 users visited in the last hour