Question

Infer somatic mutations without normal control

2

Entering edit mode

6.3 years ago

CY ▴ 750

Say, we have called a list of variants from a tumor sample (normal control not available) and we try to set a number of filtering criteria to separate somatic mutation from germline mutation as well as false positive mutation. We have set some criteria as listed below. For each variant called:

1) ignore position with depth < 50

2) ignore variant call quality < 30

3) ignore allele frequency < 0.02

4) ignore allele frequency equal to 0.5 or > 0.9

5) ignore variant appear in dbSNP

6) ignore variant not in COSMIC

7) check neighboring variants and see if they have similar allele frequency. If yes, this region maybe polyploid and all variants within may actually be germline mutation. For example allele frequency of 4 neighboring variants: 0.23; 0.21; 0.19; 0.24 within tetraploid

Can anyone share some comments on whether any of these criteria sounds unreasonable? More importantly, can you think of any more criteria that may help to get real somatic mutation? Really appreciate.

variant calling somatic mutation • 3.1k views

ADD COMMENT • link updated 6.3 years ago by markus.riester ▴ 550 • written 6.3 years ago by CY ▴ 750

1

Entering edit mode

Can anyone share some opinions on this topic? Really appreciate!

ADD REPLY • link 6.3 years ago by CY ▴ 750

1

Entering edit mode

I strongly recommend against filtering with dbSNP. dbSNP contains somatic variants that have functional consequences and can be disease-driving. Better filter against the 1000Genomes project, as these variants reflect common human variation. This could be done with the Variant Effector Predictor from Ensemble.

ADD REPLY • link 6.3 years ago by ATpoint 82k

0

Entering edit mode

Agree. Thanks for sharing

ADD REPLY • link 6.3 years ago by CY ▴ 750

0

Entering edit mode

By using VEP to perform that analysis, what would be the flag that would point to a potential germline variant? Thanks!

ADD REPLY • link 4.6 years ago by dodausp ▴ 180

0

Entering edit mode

You mean germline 'variant'? You could check the 1000 Genomes minor allele frequencies to infer whether or not the variant in question has been observed in this cohort, and thus, infer whether or not it is a germline variant - nothing can ever 100% guarantee that it is germline, though. You will require the matched normal sample for that. Also remember that some germline variants can increase risk of cancer.

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks, @Kevin Blighe! Always very helpful. Yes, I meant "variant". Sorry about that. I corrected it now. (: And yes, I totally agree that this should be only a rough estimation. In the case you're describing here, would enabling the option "Exclude common variants" on the VEP tool achieve that?

ADD REPLY • link 4.6 years ago by dodausp ▴ 180

0

Entering edit mode

To a certain extent it would achieve that - yes. The 'Exclude common variants' filter is explained like this:

Exclude common variants

Filter out variants that are co-located with an existing variant that has a frequency greater than 0.01 (1%) in the 1000 Genomes global population. Equivalent to --filter_common in the VEP script.

[source: https://uswest.ensembl.org/info/docs/tools/vep/online/input.html]

So, it filters based on a MAF cut-off.

Provided that you document everything that you do, the specificities of the filtering steps can be debated at a later time, possibly by peer reviewers.

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry about this late reply. Somehow some replies were not showing on my feed. Yes, I had read the explanation for that before, but I wasn't still very sure whether by using this would be enough to estimate the somatic variants. And after running VEP, I saw that actually a lot of variants were filtered out. And by skimming through them, they made sense. Thank you again, @Kevin Blighe! Honestly, if it wasn't for your input in a lot of threads here, it would take me 10 times more to mine the answers. Thank you! And many members in this awesome community.

ADD REPLY • link 4.5 years ago by dodausp ▴ 180

0

Entering edit mode

Also, I considered using Mutect2 in the tumor-only mode, but it seems that there is still the need to build a PoN. And from what I understood there, a way to call the somatic variants in tumor only samples would be to first create a PoN from all the tumors, identifying the overlapping variants, and then exclude them from each sample? Is that correct?

ADD REPLY • link 4.6 years ago by dodausp ▴ 180

0

Entering edit mode

I am not too familiar with Mutect2, as I use Lancet for somatic variant calling. Would you not have to create the PoN from actual normal samples, though? I guess that the logic about common variants in the tumours is that these would not normally occur by chance in the context of somatic mutation, and could therefore be assumed to be germline variants.

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

My impression is that PoN in MuTect2 is primarily used to exclude artifacts that occur frequently across your cohort.

ADD REPLY • link 4.6 years ago by CY ▴ 750

0

Entering edit mode

Thank you

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, @CY. And as @Kevin Blighe pointed out above, do you think that by eliminating the common variants found in my samples (cancer only), this could be used as a step to filter out potential germline variants? Would it improve the performance if used jointly with the VEP analysis, or most of it would likely overlap with what is found on VEP?

ADD REPLY • link 4.5 years ago by dodausp ▴ 180

0

Entering edit mode

Yes, I think geting ride of non-reference site reccurrently appear in the PoN would practically filter out both artifacts and germline variants.

ADD REPLY • link 4.5 years ago by CY ▴ 750

0

Entering edit mode

Thank you, @CY It does make a lot of sense.

ADD REPLY • link 4.5 years ago by dodausp ▴ 180

score 1 · Answer 1 · 2018-01-12

I assume you are talking about high coverage data? Then yes, ad hoc, but pretty standard. Ideally one examines the coverage of the assay over many samples and understands why some regions have low coverage. Then understand how this impacts your variant calling.
Should be fine.
Depends what you want to accomplish with this filter. But yes, 2-5% is fairly standard for baseline tissue samples, mainly to filter variants that are potentially cross-contaminated or so sub-clonal that they are unlikely functional. If cross-contamination is no concern and this is for artifact filtering, then rather filter by supporting reads if you don't trust the caller's likelihood model for borderline calls.
This will at some point filter variants in amplifications and in regions of LOH (see also 7). To filter homozygous private germline variants, I would require a minimum number of reference reads to keep and <0.95 to 0.98 allelic fraction, again mainly to deal with cross-sample contamination.
Add ExAC if you can, especially non-caucasian populations.
This will filter truncating mutations in tumor suppressor genes. So keep truncating mutations, frame-shifts etc.
If you want to use allelic fractions, then you need to do it properly, ideally also using copy number adjusted for purity and ploidy (heterozygous SNPs are rare, but you have lots of coverage information close by). But this is hard (see here, here and here - disclaimer, the latter is work by us). Since you filter everything that is not in COSMIC anyways, I guess your goal of 7) is mainly to ignore germline variants in COSMIC? Then maybe require a higher number of COSMIC hits (3-5), especially in difficult genomic regions of low mappability.

I did not see a pool of normals filter in your list, that is usually extremely helpful in removing noise.