Question: Discrimination Between Germline And Somatic Mutations In Tumor Without The Availability Of The Normal Paired Sample
gravatar for Fred Fleche
6.9 years ago by
Fred Fleche4.3k
Paris, France
Fred Fleche4.3k wrote:


Let's say that I get whole-exome-sequencing data file that has been created without the availability of a normal sample related to the tumor sample sequenced. Is there a way to make the disctinction bewteen germline and somatic variants. I was thinking of comparing the variants against the COSMIC (Catalog Of Somatic Mutations In Cancer) database.

So I was wondering if some people have some suggestions of a nice accurate workflow with other sources than COSMIC.



somatic mutation • 18k views
ADD COMMENTlink modified 6.9 years ago by Stefano Berri4.1k • written 6.9 years ago by Fred Fleche4.3k
gravatar for Christof Winter
6.9 years ago by
Lund, Sweden
Christof Winter990 wrote:

Here is what I do:

  1. Flag known germline variants by looking in dbSNP. I use a subset of dbSNP (> 1% minor allele frequency, mapping only once to reference assembly, and not flagged as "clinically associated"). You can get such a file for ANNOVAR (database name is snp137NonFlagged for the current dbSNP build), see

  2. Flag known somatic variants by looking in COSMIC. This usually finds well-described hotspot mutations (such as activating KRAS mutations), but overall will not find most of your true somatic variants (my guess). I usually take the whole of COSMIC, irrespective of tumor type.

  3. Add other cancer sequencing studies (e.g. TCGA), as many of these are not yet in COSMIC currently. For TCGA, I use the MAF files available at Level 3 MAF files contain experimentally validated somatic mutations only. Level 2 MAF files contain also the unvalidated ones (and can contain germline variants).

  4. Look at the variant allele frequency. If it's 100%, i.e. all reads show the variant, it's very likely germline (unless your tumor sample is 100% tumor cells and all tumor cells have the mutation). If it's below 10%, it can well be an artifact, see e.g.

  5. Check how all of the mismatches in your data (non-reference bases in the alignment) are distributed along the reads from 5' to 3'. If you have a much higher mismatch rate at the first/last bases of your reads, you might want to exclude these read positions.

  6. Filter your variant list further, as it will likely contain a considerable amount of false positives. Table 1 of the VarScan paper is a good start (read pos, strand, variant read number and frequency, distance to 3', homopolymer, map quality and read length difference).

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Christof Winter990

A small addition to point number 1: You could also look for known variants in the 1000 genomes project data.

ADD REPLYlink written 6.9 years ago by fo3c430

Thanks a lot for sharing your experience. Really appreciate. Fred

ADD REPLYlink written 6.9 years ago by Fred Fleche4.3k
gravatar for Stefano Berri
6.9 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Looking at already known cancer mutation is fine, but you can tell only about what it is already known.

Personally, I would look at frequency of mutations. If it is germline it is either 100% or 50% (clearly, not exactly 50%, but around there).

If it is a somatic mutation and your samples are from clinical samples (not cell lines), then infiltration with normal cells is inevitable and your mutations will be at 30-40%

If coverage is enough, you might confidently distinguish between the two.

To better understan what I mean, I suggest you this great paper

ADD COMMENTlink written 6.9 years ago by Stefano Berri4.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour