Discrimination Between Germline And Somatic Mutations In Tumor Without The Availability Of The Normal Paired Sample
Entering edit mode
11.1 years ago


Let's say that I get whole-exome-sequencing data file that has been created without the availability of a normal sample related to the tumor sample sequenced. Is there a way to make the disctinction bewteen germline and somatic variants. I was thinking of comparing the variants against the COSMIC (Catalog Of Somatic Mutations In Cancer) database.

So I was wondering if some people have some suggestions of a nice accurate workflow with other sources than COSMIC.



mutation somatic • 21k views
Entering edit mode
11.1 years ago
Christof Winter ★ 1.0k

Here is what I do:

  1. Flag known germline variants by looking in dbSNP. I use a subset of dbSNP (> 1% minor allele frequency, mapping only once to reference assembly, and not flagged as "clinically associated"). You can get such a file for ANNOVAR (database name is snp137NonFlagged for the current dbSNP build), see http://www.openbioinformatics.org/annovar/annovar_download.html

  2. Flag known somatic variants by looking in COSMIC. This usually finds well-described hotspot mutations (such as activating KRAS mutations), but overall will not find most of your true somatic variants (my guess). I usually take the whole of COSMIC, irrespective of tumor type.

  3. Add other cancer sequencing studies (e.g. TCGA), as many of these are not yet in COSMIC currently. For TCGA, I use the MAF files available at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. Level 3 MAF files contain experimentally validated somatic mutations only. Level 2 MAF files contain also the unvalidated ones (and can contain germline variants).

  4. Look at the variant allele frequency. If it's 100%, i.e. all reads show the variant, it's very likely germline (unless your tumor sample is 100% tumor cells and all tumor cells have the mutation). If it's below 10%, it can well be an artifact, see e.g. http://www.ncbi.nlm.nih.gov/pubmed/23303777

  5. Check how all of the mismatches in your data (non-reference bases in the alignment) are distributed along the reads from 5' to 3'. If you have a much higher mismatch rate at the first/last bases of your reads, you might want to exclude these read positions.

  6. Filter your variant list further, as it will likely contain a considerable amount of false positives. Table 1 of the VarScan paper http://www.ncbi.nlm.nih.gov/pubmed/22300766 is a good start (read pos, strand, variant read number and frequency, distance to 3', homopolymer, map quality and read length difference).

Entering edit mode

A small addition to point number 1: You could also look for known variants in the 1000 genomes project data.

Entering edit mode

Thanks a lot for sharing your experience. Really appreciate. Fred

Entering edit mode
11.1 years ago

Looking at already known cancer mutation is fine, but you can tell only about what it is already known.

Personally, I would look at frequency of mutations. If it is germline it is either 100% or 50% (clearly, not exactly 50%, but around there).

If it is a somatic mutation and your samples are from clinical samples (not cell lines), then infiltration with normal cells is inevitable and your mutations will be at 30-40%

If coverage is enough, you might confidently distinguish between the two.

To better understan what I mean, I suggest you this great paper


Login before adding your answer.

Traffic: 2176 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6