Question

CNV benchmarking

1

Entering edit mode

8 months ago

emmanouil.a ▴ 120

Hi, I know how to do small germline variants benchmarking and now I want to do the same for germline CNV, to calculate sensitivity and specificity/precision.

Can you suggest me what standard sample to run on the sequencer, where to find it's true CNV set and any good benchmarking tool to use?

Thank you very much!

CNV Benchmarking • 703 views

ADD COMMENT • link updated 8 months ago by d-cameron ★ 2.9k • written 8 months ago by emmanouil.a ▴ 120

0

Entering edit mode

this kind of problem is so well-studied that we cannot tell you anything more about a best answer or best practice until you tell us what data type(s) you are working with.

ADD REPLY • link 8 months ago by LauferVA 4.2k

0

Entering edit mode

Hi, is related to Human DNAseq data (WGS for germline CNV discovery) using Illumina technology.

ADD REPLY • link 8 months ago by emmanouil.a ▴ 120

score 1 · Answer 1 · 2023-08-17

CNV benchmarking is a mess. There are multiple ways to score CNV calling and they give very different results and none of the scoring methods accurately reflect the different emphasis for different use cases.

There are two main approaches:

Per-base

Matching is done based on how much of the genome matches and how much it differs.

Per-segment

Matching is done based on whether the CN caller has both called the correct copy number and segmented the CN changes correctly.

Per-segment is more powerful but much messier. Segmentation depends on the CN resolution so you need criteria for matching as a your caller might break up a 1Mb segment into three for a 50bp deletion, but your truth set only includes CN segments greater than 100bp (or maybe it's the other way around for your data).

The problem is that neither of these approaches actually give summary metrics that correspond to the biologically meaningful information that is of most relevance to the research/clinical usage. For example, A copy number caller that has 99.99% precision&recall is still a pretty bad somatic CNV caller if it misses a 100x amplication of AR.

Matching logic

The next consideration is what you consider to be a TP/FP. Some callers just report loss/ref/gain, some expand this to total loss/hom loss/ref/gain/large gain, some report integer CN and still others report floating point (somatic subclonality means copy number isn't necessarily integer). You can report TP/FP based on an absolute or relative threshold, or you can use the absolute or relative delta to generate a distribution of how close the caller and truth are. The latter is more informative but doesn't give you precision/recall - you need to threshold it for that.

In practice, I use a combination of the techniques when developing and evaluating CNV calling. There are three key criteria for CN calling: are the CN transitions correct (how many missing/extra transitions & how far away from the true transitions are the TPs?); how far off the true CN are the CN calls; can it find known biological events of interest? The final question requires a large a cohort as possible as you get at most just a handful of data points per sample. Generally speaking, I reduce these to summary metrics (precision/recall, median transition error distance) for regression tests but use the more informative distributions internally.

where to find it's true CNV set

AFAIK, there are no truly comprehensive CNV truth sets out there. For human germline data, your best bet is to use the HG002 T2T to generate a CN track though realignment of the haplotypes (iterative for INS as each insertion corresponds to a copy number gain at the insertion donor site). For cancer there's a few truth sets out there of varying quality. The Hartwig COLO829 SV truth set is great for CN transition evaluation but it's not an actual CN truth set.