Question

Filetring in vcf files

0

Entering edit mode

7.4 years ago

Gene_MMP8 ▴ 240

So I have a vcf file (myvar.vcf) and I want to do the following:-
1. Remove all mutations tagged as INDELS.
2. Consider mutaions that have a read depth greater than 100
3.To get all the mutations that occur in the tumor genome with respect to the control genome, we select all the mutations that have a genotype of 0/0 for the control genome and have a genotype of 1/1 or 0/1 for the tumor genome.

I have been trying all the three for quite a long time, but with no result. So can anyone help me with the steps? Thanks.

alignment • 2.5k views

ADD COMMENT • link updated 7.4 years ago by Pierre Lindenbaum 164k • written 7.4 years ago by Gene_MMP8 ▴ 240

2

Entering edit mode

Have you tried doing it with SNPeff/SNPsift? The manual is well written and should guide you to do that.

ADD REPLY • link 7.4 years ago by VHahaut ★ 1.2k

0

Entering edit mode

Thanks. Wil try that

ADD REPLY • link 7.4 years ago by Gene_MMP8 ▴ 240

2

Entering edit mode

"read depth greater than 100" in both samples? What if one has read depth 100 and the other 99? Total read depth? What if a mutation appears somatic as your tumour has read depth 100 and your normal 0?

ADD REPLY • link 7.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

i have 12 samples in all. So read depth > 100 in all of them. I would filter out these sites because I dont want to trust mutations at sites with very high coverage, because they might represent variation between variable copy number repeats, i.e., the reads that map to this location in the reference are actually from duplicated sites in the sample.

ADD REPLY • link 7.4 years ago by Gene_MMP8 ▴ 240

2

Entering edit mode

Your genotyping filter assumes diploid. This condition is frequently violated in tumours. When you also consider tumour purity and sub-clonality, variant allele frequencies of 0.33/0.66 are perfectly normal even for diploid tumours.

ADD REPLY • link 7.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

7.4 years ago

d-cameron ★ 2.9k

What program generated the VCF? A somatic variant caller (which you should be using if you are doing somatic variant analysis) will typically use the SOMATIC flag to identify tumour-only mutations.

Have you looked at the VCF headers in your file? The R VariantAnnotation package will allow you to perform each of the filters you want to do each in 1 line of code.

ADD COMMENT • link 7.4 years ago by d-cameron ★ 2.9k

0

Entering edit mode

I have tried Variant Annotation package. But it takes a very long time. Each tumor and control file is 100GB in size on an average. Any suggestions?

ADD REPLY • link 7.4 years ago by Gene_MMP8 ▴ 240

score 2 · Accepted Answer · 2017-05-13

using vcffilterjs http://lindenb.github.io/jvarkit/VCFFilterJS.html

java -jar dist/vcffilterjs.jar -f script.js input.vcf

and the following script:

var sample2status={
    "Sample1":0,
    "Sample2":0,
    "Sample3":1
    };

function accept(v)
    {
    if(v.isIndel()) return false;
    if(v.hasAttribute("DP") && v.getAttributeAsInt("DP",0) <100) return false;
    for(var sampleName in sample2status)
        {
        var status = sample2status[sampleName];
        var genotype= v.getGenotype(sampleName);
        if(genotype==null || genotype.isNoCall()) continue;
        if( status == 0 && !genotype.isHomRef()) return false;
        if( status == 1 && !(genotype.isHomVar() || genotype.isHet())) return false;
        }
    return true;
    }
 accept(variant);