Hi folks,
Total newbie here. I have just received WES files from sequencing my DNA and I am interested in calculating some basic statistics. First, I would like to know how much of the human exome my results actually cover - so, to answer the question "how whole is whole"?. Secondly, I am told the average coverage is 50X, however I would like to see how this varies locally.
How should I go about doing this? I have an Ubuntu laptop and have started installing tools like IGV and Alfred, but I basically have no idea what I am doing so I need some simple instructions to get going.
My data consists of two gzipped FASTQ files, (R1 and R2), a gzipped VCF file and a BAM file. What should I do first with these files?
Thanks!
You should take a look at
mosdepth
- it will give you per-base depth metrics from a BAM file. With a targets BED file, you can use it to see how much of the exome is covered.Thanks, that's very useful. I found I got quite different results with different BED files, so I assume the coordinates are not universal. However, I found that Twist_Exome_Target_hg38.bed produced results that at least looked reasonable. For example, in the produced *region.dist.txt I see
which I'm interpreting to mean that 86% of the exome is sequenced to 30X coverage. When I tried the hg19.bed I saw a lot of warnings like
and the equivalent output line is
so I'm pretty sure that is wrong. I have asked the company behind the consumer kits for some information but they have been slow to respond (I'm certain they are very busy right now). Anyway, it is more fun to figure things out this way...
Are you sure the reference matches up? You're right - the sequencing folks will know the kit used and will be able to give you the targets BED file.
It is expected that you'll get different results with different BED files at the threshold level or region level, but if bases overlap between the BED files, there should be no difference in the depth on those bases unless mosdepth was run with differrent settings - it excluded certain reads by default based on their mapping flags.
I ran with the same settings, changing only the BED file. I think you and Bruce are right though, I need a bit more information from the sequencing company, so I've asked for the appropriate BED file. We shall see!
To conclude, I got the appropriate BED file from the DTC sequencing company, which is for an Agilent exome kit and the hg38 assembly. The mosdepth results look sensible (no error messages) and it appears 83% of the exome was sequenced at 30X coverage or above, and 91% at 20X.