Question

Flanking sequences around cancer variants

0

Entering edit mode

5.6 years ago

Gene_MMP8 ▴ 240

I want to analyze the flanking sequences around cancer variants. I have already downloaded cancer variants from COSMIC database. Now I am planning to extract variants from the reference build of the variant database (hg38 in this case) using bioMart. But I am having doubts regarding whether the flanking sequences properly represent what we expect from a cancer genome? For example:
AAGCT, here AA and CT are flanking sequences. What if in the original cancer bam file containing the variant G at that very position, AA and CT are mutated to AG and CA. In other words, if I want to study flanking sequences of cancer variants, is it a good idea to extract these sequences from the reference build? How much variation in the data am i losing just by doing this?

genome snp cancer_variants • 1.2k views

ADD COMMENT • link updated 5.6 years ago by jared.andrews07 ★ 16k • written 5.6 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

What is the question that you want to answer?

ADD REPLY • link 5.6 years ago by ATpoint 82k

score 3 · Accepted Answer · 2018-09-04

In other words, if I want to study flanking sequences of cancer variants, is it a good idea to extract these sequences from the reference build?

2 options to deal with this:

You can make a consensus sequence from your variant file and the reference genome, then do your analysis if you'd like. It's pretty easy to do with GATK.
When writing whatever you plan to do for the flanking sequence, check if any of the variants in your VCF file also lie in the flanking sequence and adjust the sequence as necessary if so. More annoying, potentially more informative since you can easily track how often that is occurring.

How much variation in the data am i losing just by doing this?

This is a lot tougher to answer without more info - number of variants, are you looking only at SNPs or indels as well, what sort of analysis are you running on the flanking sequence, etc. More info will yield more/better answers.