Question

Filtering variants in case-control DNA-seq study

0

Entering edit mode

6.5 years ago

YanO ▴ 140

I have a cancer case-control germline DNA sequencing study, 100 cases and 100 healthy controls. I've called my variants, and now I'm looking to filter them. What are the "rules" for filtering out variants present in controls?

Is it a strict dumping of all variants present even once in the controls? Or can I allow one occurance in controls if it's in a lot of cases? For example a variant present in 4 cases and 0 controls seems worthy of further analysis. But what about 7 cases vs 2 controls? 5 cases vs 1? Do I need to factor in, for example, the rate at which my cancer occurs in the general population and allow a few instances in controls because of that?

I've done some reading up on this, but can't seem to find any rules or guidelines. Essentially: how can I define "enrichment" in my cases over controls?

Thank you.

genome DNA-seq filtering • 1.4k views

ADD COMMENT • link 6.4 years ago by YanO ▴ 140

0

Entering edit mode

I would check for the allel frequency of the variants in the controls. As controls can be contaminated with tumor cells, it might be possible that actual somatic variants pop up in the matched normal, but the AF should be rather low. What you could also do is to remove common variants, such as those that have been identified in the 1000 Genomes project (not dbSNP!), either categorically (so allele frequency >0% in all ethnicities, or a few percent). You could do that with the Ensemble Variant Effector Predictor (VEP).

ADD REPLY • link 6.5 years ago by ATpoint 82k

0

Entering edit mode

Thanks so much for your reply. Why would I remove common variants from 1000 genomes but not common variants from dbSNP? Should I also remove variants that are common within my own sequencing data? Also, I should have been more clear sorry, my controls are healthy people, not matched normal tissue. Edited the post to be more clear.

ADD REPLY • link 6.4 years ago by YanO ▴ 140

0

Entering edit mode

There are no rules for this particular type of analysis, which is why you did no find anything. It is actually an interesting time to be in cancer research.

The first thing that I'll say to set the scene is that we don't yet know the effect/impact of the vast majority of variants or mutations. We recognise that somatic mutations in certain genes are more likely to be present in certain cancers, such as PIK3CA in breast cancer, or EGFR is lung cancer. TP53 is then often mutated across many types of carcinomae.

Some key points to consider:

Many germline variants will modulate our respective risk of cancer and other illnesses, therefore you cannot completely eiiminate variants in your matched normals. As my colleague ATPoint mentions, use allele frequency data from the large population studies in order to help you with filtering. Variants with a large allele frequency in healthy populations are less likely to greatly increse risk of cancer as, otherwise, most of us would develop cancer early in life
from where did you obtain your 'normal' DNA? - surrounding the tumour biopsy?; leukocyte/lymphocyte DNA from the buffy coat of a blood biopsy? If 'normal' tissue surrounding the tumour, then this is not quite 'normal' and may have somatic mutations present from very early tumour clones
Is your tumour sample from a bulk biopsy? - in that case, it will contain many tumour clones, each with different mutational profiles. How this will be reflected in your results is somatic mutations called at varying frequencies ranging from 1 up to almost 100%.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks so much for your thoughts. Is filtering out common germline variants a defnite rule for this type of analysis? What about variants common within my own cohort? I've such a small sample size that rare variants will be hard to verify. Also, I should have been more clear sorry, my controls are healthy people, not matched normal tissue. Edited the post to be more clear.

ADD REPLY • link 6.4 years ago by YanO ▴ 140

0

Entering edit mode

Hello again, From where did you recruit these healthy people? I have found that, in many cancer studies, the controls are actually people who go to clinics because they have a family relative who has cancer and/or they already have a benign tumour. These people are not then true controls.

In breast an ovarian cancer, BRCA1 comes to light: a substantial portion of the population carries variants in BRCA1 that raises their risk of cancer. As such, these would not be good controls.

I guess that all this comes back to the point that there is no genuine healthy control, because we each carry disease risk alleles.

My advice would be to simply tally the variants in both the cases and controls, annotated them with 1000 Genome MAFs and other stuff (like SIFT, PolyPhen, etc) and don't apply any hard filters based on the variants found in controls. Then, for example, if you have a variant that's reasonably high in cancer and appears in just one control, well, maybe that single control actually has an increased risk of cancer.

Cancer is a complex disease with complex genetics. We'd all like each disease to be explained by just a single mutation or gene, like Li Fraumeni Syndrome and TP53), but it really looks like the majority of diseases are based on extremely complex genetics.

Good luck

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k