Question: Infer somatic mutations without normal control
gravatar for CY
5 months ago by
United States
CY130 wrote:

Say, we have called a list of variants from a tumor sample (normal control not available) and we try to set a number of filtering criteria to separate somatic mutation from germline mutation as well as false positive mutation. We have set some criteria as listed below. For each variant called:

1) ignore position with depth < 50

2) ignore variant call quality < 30

3) ignore allele frequency < 0.02

4) ignore allele frequency equal to 0.5 or > 0.9

5) ignore variant appear in dbSNP

6) ignore variant not in COSMIC

7) check neighboring variants and see if they have similar allele frequency. If yes, this region maybe polyploid and all variants within may actually be germline mutation. For example allele frequency of 4 neighboring variants: 0.23; 0.21; 0.19; 0.24 within tetraploid

Can anyone share some comments on whether any of these criteria sounds unreasonable? More importantly, can you think of any more criteria that may help to get real somatic mutation? Really appreciate.

ADD COMMENTlink modified 5 months ago by markus.riester260 • written 5 months ago by CY130

I strongly recommend against filtering with dbSNP. dbSNP contains somatic variants that have functional consequences and can be disease-driving. Better filter against the 1000Genomes project, as these variants reflect common human variation. This could be done with the Variant Effector Predictor from Ensemble.

ADD REPLYlink written 5 months ago by ATpoint4.4k

Agree. Thanks for sharing

ADD REPLYlink written 5 months ago by CY130
gravatar for markus.riester
5 months ago by
markus.riester260 wrote:
  1. I assume you are talking about high coverage data? Then yes, ad hoc, but pretty standard. Ideally one examines the coverage of the assay over many samples and understands why some regions have low coverage. Then understand how this impacts your variant calling.
  2. Should be fine.
  3. Depends what you want to accomplish with this filter. But yes, 2-5% is fairly standard for baseline tissue samples, mainly to filter variants that are potentially cross-contaminated or so sub-clonal that they are unlikely functional. If cross-contamination is no concern and this is for artifact filtering, then rather filter by supporting reads if you don't trust the caller's likelihood model for borderline calls.
  4. This will at some point filter variants in amplifications and in regions of LOH (see also 7). To filter homozygous private germline variants, I would require a minimum number of reference reads to keep and <0.95 to 0.98 allelic fraction, again mainly to deal with cross-sample contamination.
  5. Add ExAC if you can, especially non-caucasian populations.
  6. This will filter truncating mutations in tumor suppressor genes. So keep truncating mutations, frame-shifts etc.
  7. If you want to use allelic fractions, then you need to do it properly, ideally also using copy number adjusted for purity and ploidy (heterozygous SNPs are rare, but you have lots of coverage information close by). But this is hard (see here, here and here - disclaimer, the latter is work by us). Since you filter everything that is not in COSMIC anyways, I guess your goal of 7) is mainly to ignore germline variants in COSMIC? Then maybe require a higher number of COSMIC hits (3-5), especially in difficult genomic regions of low mappability.

I did not see a pool of normals filter in your list, that is usually extremely helpful in removing noise.

ADD COMMENTlink written 5 months ago by markus.riester260

Hi Merkus, really appreciate your comments. I got several questions though:

For point 4, my goal is to removing heterozygous / homozygous germline mutation. Considering that tumor purity is hardly 100%, I consider allele frequency of 50% or near 100% to be germline. I am not sure why did you mention cross-sample contamination and set allele frequency this high. Could you please explain further?

For point 5, why did you emphasis non-caucasian populations?

For point 6, why did you mention "truncating mutations in tumor suppressor genes"? I set point 5 and 6 because somatic mutations usually exist in COSMIC and missing in 1000genome

ADD REPLYlink written 5 months ago by CY130

Apologies for the late response, just saw this.

4) I'm just saying that this filter will under some circumstances remove somatic variants. But yes, in most cases, especially low purity samples, you will be fine. But for the proper way, see 7 (see here for another recent paper). Cross-sample contamination can make homozygous SNPs appear heterozygous, so a 0.95 filter should remove all homozygous SNPs, even in the presence of contamination. And yes, somatic variants with such a high allelic fraction are rare, but happen in high purity samples or amplifications. 5) non-caucasian populations are underrepresented in public databases, but if you use large databases like ExAC or gnomAD, the ethnicity bias gets for common germline filtering becomes fairly small. 6) No, not always. Depends also how you annotate indels.

ADD REPLYlink written 4 months ago by markus.riester260
gravatar for CY
5 months ago by
United States
CY130 wrote:

Can anyone share some opinions on this topic? Really appreciate!

ADD COMMENTlink written 5 months ago by CY130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour