Hi, I have been trying to manually filter a vcf file by QD score, but I have not been able to do it properly. I thought of using awk and indicate that a specific field has a value greater than 8 (that is the value I want to filter it), but I have records where the QD variable does not occupy the same position.
1 2422614 . G A 3711.77 . AF=1;DP=120;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.93;SOR=1.773 GT:AD:ADO:DP:GQ:PL 1/1:0,120:0:120:99:0 1 2430617 . C T 2052.76 . AF=0.5;BaseQRankSum=-1.791;ClippingRankSum=0;DP=292;ExcessHet=3.0103;FS=6.978;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=7.03;ReadPosRankSum=-1.813;SOR=0.652 GT:AD:ADO:DP:GQ:PL 0/1:144,148:0:292:99:3740
Thank you very much
Hello ajuiwl ,
why do you want to use
awkand not a program that is already specilized on parsing and filtering a
If you realy need to use
awkyou will have to use something like
match()and define a regex pattern.
Some programs change the way variants are normalized, the header, require installations, require dependencies, and other issues that bash code can avoid, moreover I have more control over what I do.
I like bash as well. But when it comes to specific tasks where already widely used program exists, I strongly recommend using this. If reinventing the wheel, there is a high chance that you miss something very important. See the discussion about the
perlsolution below. The problems discussed there are valid for your solution as well.
I would suggest you take a look at bioconda. Using this you will have (nearly) no problems with installations and dependencies. (See also the first part of this tutorial a have written: Creating workflows with snakemake and conda )
That is true, I just found that there are some variants in the vcf that didn't have INFO about QD nor DP, and the commands I did didn't considered this. So I agree with you using already made software in that sense. In addition I will add that sometimes you are working in a server that requires the software to be installed by an admin and it may take some time to get it installed. And that doing it by yourself, you spend more time but you can discover things that you had not considered before, increasing your knowledge about the topic. Thank you for your time!!
Hi, try this line;
perl -ne 'print $_ if /QD=(\d+)/ && $1>8' test.txt |less -S
This answer does not consider decimal positions, in the way that 8.1 won't be kept.
Sorry, but this is easy to solve.
perl -ne 'print $_ if /QD=(\d+(\.\d+)?)/ && $1>8' test.txt |less -S
Just a couple of notes, if I'm not mistaken... the perl script will not interpret correctly tags ending in
QD, e.g. a line with
XQD=10;QD=2will be incorrectly returned. Also, negative numbers are not parsed correctly. If you filter for QD > -10, a line with
QD=-1will be missed.
OK, let's modify this further:
perl -ne 'print $_ if /;QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S
Sorry to be a pain... This fails if the QD tag is the first one in the INFO field since in such case it will not be preceded by
Never mind. If so, you can add another '?' :
perl -ne 'print $_ if /;?QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S
It is just a solution for simple data as supplied in the post, with more complex data, we can do some modification of this, or use other better methods as below.
A simpler yet general solution would be this one:
If you don't want to lose the headers in the filtering process, then you should go for this one:
Although I definitely would go for bcftools for VCF filtering.