DP (read depth) in VCF files?
1
1
Entering edit mode
7.0 years ago
barah_chbib ▴ 10

In Galaxy when using VCF filter the default parameters are -f"DP>10", why does having higher DP matter for calling genotypes in diploids?

VCF filter • 13k views
ADD COMMENT
25
Entering edit mode
7.0 years ago

For heterozygous and homozygous germline variants, you don't require your read depth to be too high. From a clinical perspective, the best range in which to call these variants is between read-depth 18 and anything up to 70 or 80. 30 is the sweet-spot, though. You should not expect that higher read-depth than 30 will actually improve accuracy/precision - it may just serve to introduce more error. 18 is the minimum recommended read-depth at which to call variants in the UK clinical genetics scene, and we confirmed this through our validation work comparing NGS against Sanger in the National Health Service England.

They typically use 10 (or lower) as the cut-off in research settings because, being research, they can tolerate more error when their results have no (or minimal) clinical implication(s). However, it is of course possible to call true variants from just 1 (homozygous) or 2 (heterozygous) reads, although you will not find many that recommend this.

In scenarios like cancer and circulating free DNA (cfDNA) analysis, where you'd expect to find variants / mutations at frequencies other than 50% or 100%, having high read depth gives you a better chance of finding low frequency mutations in the 'tumour bulk' that was your biopsy that was sequenced. Thus, 30 is not ideal here. This is of much utility because tumours consist of multiple clones, each with their own mutation profile. When you analyse a tumour bulk biopsy, you're in fact analyising many clones at the same time. If you have a primary tumour and a matched metastatic sample, you can then see how certain mutations may become more (or less) frequent in the metastatic sample as compared to the primary, and thus infer their role in metastasis.

Many variant callers will actually downsample your reads when they call variants, i.e., they only look at 500 or 1000 reads and then don't process any further. This is in part to save processing time and memory.

As to why we need multiple reads, well, NGS is 'messy'. It has many inherent errors at each step in the process (assuming sequencing by synthesis, as is used by Illumina). Error can occur:

  • When reads are being sequenced in the sequencer, polymerase is known to add incorrect bases, but in the sequencer there are no excision-base DNA repair mechanisms to correct these ( however, there is a PhiX control: Can phasing or pre-phase during basecall cause indel? )
  • When the base-calling software attempts to read the fluorescent signals to infer which base was added
  • When the aligner mis-aligns a read (due to the high level of sequence similarity that exists all across the human genome)
  • When the variant caller incorrectly calls or fails to call a variant
  • Incorrect quality-scores / programs interpreting quality scores differently

Although I mentioned that you can call true variants from just 1 or 2 reads, this is simply not advised because, what happens? - the aligner will mis-align many reads and assign them a high mapping quality. In a targeted experiment, you'll see dozens or 100s or 1000s of positions across your genome that are just mapped by 1 read. Calling variants on these is a nightmare because you'll get 10s or 100s or thousands of variant calls due to the fact that many of these alignments are errors and thus bases won't exactly match to the reference. This is why we also use BED files for honing down on regions in which we initially targeted (and filtering out information from all others).

One must be aware, though, that even the gold standard Sanger sequencing has inherent errors. Error exists everywhere in technology and we have to understand what our tolerances to —and thresholds for— it are.

Further reading, see here: A: Sanger sequencing is no longer the gold standard?

ADD COMMENT
0
Entering edit mode

Very informative! Finally some of my questions were answered!

About your following comment, could you please point us to a publication where these calculations have been worked out? I am very interested to see the mathematical equations/calculations behind 18, 30, and 70-80.

or heterozygous and homozygous germline variants, you don't require your read depth to be too high. From a clinical perspective, the best range in which to call these variants is between read-depth 18 and anything up to 70 or 80. 30 is the sweet-spot, though. You should not expect that higher read-depth than 30 will actually improve accuracy/precision - it may just serve to introduce more error.

ADD REPLY
1
Entering edit mode

Hey, the work was not published but is filed in documents in the National Health Service England,

Essentially, our standard sequencing protocol resulted in most variants being called at an average position read depth of ~70 reads. Then, with a set of 'validation samples' that were run on both NGS and Sanger, we did the following:

  1. take Sanger-confirmed NGS variants
  2. sub-sample the reads in the aligned BAM files to simulate reduced read depth, and then re-call variants on these 'reduced' BAMs
  3. check the last known position read depth at which each Sanger-confirmed variant could be detected

What we found was that, at read depth 18, one could still detect ~97% of the original Sanger-confirmed variants. At read depth 30, 100% of the variants were detected. Sequencing to higher read depths made no change whatsoever, so, the conclusion was that most labs are wasting money by sequencing to read depths >500 or >1000.

ADD REPLY
0
Entering edit mode

This is very informative, thanks a lot! Do you happen to remember what percentage of the variants would be detected at read depth 10? Also it would be great to know whether the documents you are referring to are publicly accessible?

ADD REPLY
1
Entering edit mode

I think that it was still surprisingly high at 10, but not sufficient for a clinical lab. WE did try to do a more comprehensive analysis and publish the work in Genetics in Medicine, but we had no resources. I'm not sure where you're based but the National Health Service in England is not strictly research focused. I would recommend contacting the lab where I was based and asking for Nick Beauchamp, with whom I did this work. There is a monitored email address on this page: https://www.sheffieldchildrens.nhs.uk/contact-us/

If you mention my name, Nick will remember.

ADD REPLY
1
Entering edit mode

Sorry, pasted the incorrect link: here it is: https://www.sheffieldchildrens.nhs.uk/sdgs/

ADD REPLY

Login before adding your answer.

Traffic: 891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6