Question: Variant filtration on lowGQ values
0
gravatar for geneart$$
4 days ago by
geneart$$20
United States
geneart$$20 wrote:

Hi all, I have a vcf that I made by followign GATK best practices workflow and I filtered genotypes with low GQ < 20. However I understand that they are not removed instead they are tagged as "FILTER_GQ_20" in my vcf.

gatk VariantFiltration \
      -V all_jointcalls_sRecal_allPASS_PP.vcf \
      -G-filter "GQ < 20" -G-filter-name "FILTER_GQ-20" \
      -O all_jointcalls_sRecal_allPASS_PP2.vcf

I tried to remove all rows with FILTER_GQ-20 by doing a simple grep:

cat all_jointcalls_sRecal_allPASS_PP2.vcf | grep -v "FILTER_GQ-20" > all_jointcalls_sRecal_allPASS_GQ20orhiger.vcf

THen I checked to see how many are present that are good ,GQ>20

cat all_jointcalls_sRecal_allPASS_GQ20orhiger.vcf | wc -l
212298

This seems way low when compared to the original vcf from Genotype Posteriors:

all_jointcalls_sRecal_allPASS_PP2.vcf which has 3598528 variants.

So my question is :

How to remove those variants with FILTER_GA-20 tags properly, in a GATK way, if simple unix command did not do the job right? I checked SelectVariants but if I do exclude filter, I dont think it is right.I checked on on other exclude options but none seem right for what I need to do, hence the post!

Do I need to be worried with the low number passing GQ filter? THis is a WES data .

Is it even necessary to remove them for downstream analysis like VariantAnnotator or funcotator?

also, on another note; is it absolute requirement to have a ped file for annotation and funcotator?

Thankyou in advance.

genotype filtering gatk • 56 views
ADD COMMENTlink modified 4 days ago • written 4 days ago by geneart$$20

please, have a look at the file itself. See if something is wrong (bad expression variant are badly filtered). Don't count the number of variants without excluding the header. Count the variant before and after filtering, etc...

ow to remove those variants with FILTER_GA-20 tags properly, in a GATK way . I checked SelectVariants but if I do exclude filter, I dont think it is right.

huhh ?

in a GATK way

I do like gatk but bcftools is fine and faster.

Your other questions depends of what you want to do with your data.

ADD REPLYlink written 4 days ago by Pierre Lindenbaum130k

Hi Pierre, THankyou for taking time to reply!

I did take a look at the file before filtering and after filtering. Yes I had counted without the header. The reason you are seeing the oneliner in my earlier post without a grep -v "##" is cos when we use grep to filter out vcf files the header following ## is not retained. But it is the same number in output:

zcat all_jointcalls_sRecal_allPASS_PP2_GQ20orHigher.vcf | grep -v "##" | wc -l                                                           
212298

( well minus 1 here coz this has the header starting with chrm pos etc)

cat all_jointcalls_sRecal_allPASS_PP2.vcf |  grep -v "##" | wc -l
3598528

I was hesitant to use bcftools options to filter and thought GATK might have a way of doing this and hence the post. I guess I will have to try bcftools and see if that works for me. THankyou again!

ADD REPLYlink modified 4 days ago • written 4 days ago by geneart$$20
0
gravatar for geneart$$
4 days ago by
geneart$$20
United States
geneart$$20 wrote:

Update on my issue above:

The file I made by grep that has such low number of variants is because , for some bizzarre reason it is also removing GQ that are >20 randomly ! I cant figure out why! Not sure what is going on ! So I went ahead with vcf tools but I still see the rows with "Filter_GQ-20 " existing after this script.

vcftools --remove-filtered-geno FILTER_GQ-20  --vcf all_jointcalls_sRecal_allPASS_PP2.vcf  --recode --recode-INFO-all --out all_jointcalls_sRecal_allPASS_PP2_highGQ20.vcf


After filtering, kept 3598527 out of a possible 3598527 Sites. So nothing got filtered !

By the way I tried using --remove-filtered-geno-all as well to remove everything except filter tags not equal to "." (a missing value) or PASS. nothing filtered out.

My question is :

Because my vcf is a multisample file and some samples have FILTER_GQ-20 for a variant while other sample has a PASS, is this making things not work?

this is from my vcf , from a single row:

./.:1,0:1:FILTER_GQ-20:12:0,3,40:0,12,63        ./.:2,0:2:FILTER_GQ-20:15:0,6,49:0,15,72        ./.:0,0:0:PASS:.:0,0,0:.        ./.:0,0:0:PASS:.:0,0,0:.

IF so there has to be another way to remove variants that has FILTER_GQ-20 per sample . Am I on the right track or missing something here? How to address that if that is so?

ADD COMMENTlink modified 4 days ago • written 4 days ago by geneart$$20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour