Question

How to get basic statistics on WES data

0

Entering edit mode

4.0 years ago

dagewa • 0

Hi folks,

Total newbie here. I have just received WES files from sequencing my DNA and I am interested in calculating some basic statistics. First, I would like to know how much of the human exome my results actually cover - so, to answer the question "how whole is whole"?. Secondly, I am told the average coverage is 50X, however I would like to see how this varies locally.

How should I go about doing this? I have an Ubuntu laptop and have started installing tools like IGV and Alfred, but I basically have no idea what I am doing so I need some simple instructions to get going.

My data consists of two gzipped FASTQ files, (R1 and R2), a gzipped VCF file and a BAM file. What should I do first with these files?

Thanks!

sequencing • 1.1k views

ADD COMMENT • link updated 4.0 years ago by bruce.moran ▴ 960 • written 4.0 years ago by dagewa • 0

1

Entering edit mode

You should take a look at mosdepth - it will give you per-base depth metrics from a BAM file. With a targets BED file, you can use it to see how much of the exome is covered.

ADD REPLY • link 4.0 years ago by Ram 43k

0

Entering edit mode

Thanks, that's very useful. I found I got quite different results with different BED files, so I assume the coordinates are not universal. However, I found that Twist_Exome_Target_hg38.bed produced results that at least looked reasonable. For example, in the produced *region.dist.txt I see

total   30      0.86

which I'm interpreting to mean that 86% of the exome is sequenced to 30X coverage. When I tried the hg19.bed I saw a lot of warnings like

[mosdepth] warning requested interval outside of chromosome range:249106098..249106569

and the equivalent output line is

total   30      0.08

so I'm pretty sure that is wrong. I have asked the company behind the consumer kits for some information but they have been slow to respond (I'm certain they are very busy right now). Anyway, it is more fun to figure things out this way...

ADD REPLY • link 4.0 years ago by dagewa • 0

0

Entering edit mode

Are you sure the reference matches up? You're right - the sequencing folks will know the kit used and will be able to give you the targets BED file.

It is expected that you'll get different results with different BED files at the threshold level or region level, but if bases overlap between the BED files, there should be no difference in the depth on those bases unless mosdepth was run with differrent settings - it excluded certain reads by default based on their mapping flags.

ADD REPLY • link 4.0 years ago by Ram 43k

0

Entering edit mode

I ran with the same settings, changing only the BED file. I think you and Bruce are right though, I need a bit more information from the sequencing company, so I've asked for the appropriate BED file. We shall see!

ADD REPLY • link 4.0 years ago by dagewa • 0

0

Entering edit mode

To conclude, I got the appropriate BED file from the DTC sequencing company, which is for an Agilent exome kit and the hg38 assembly. The mosdepth results look sensible (no error messages) and it appears 83% of the exome was sequenced at 30X coverage or above, and 91% at 20X.

ADD REPLY • link 4.0 years ago by dagewa • 0

score 0 · Answer 1 · 2020-04-09

1) Tell whoever gave you the data that you need help from someone who knows how to process this kind of data.

2) Google to try and find someone local to you who works in the specific field (rare disease, cancer etc.) and works with exome data.

3) Use a reproducible pipeline like https://nf-co.re/sarek