Question: Filter VCF with bash commands
1
gravatar for ajuiwl
9 months ago by
ajuiwl30
ajuiwl30 wrote:

Hi, I have been trying to manually filter a vcf file by QD score, but I have not been able to do it properly. I thought of using awk and indicate that a specific field has a value greater than 8 (that is the value I want to filter it), but I have records where the QD variable does not occupy the same position.

1       2422614 .       G       A       3711.77 .       AF=1;DP=120;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.93;SOR=1.773      GT:AD:ADO:DP:GQ:PL      1/1:0,120:0:120:99:0

1       2430617 .       C       T       2052.76 .       AF=0.5;BaseQRankSum=-1.791;ClippingRankSum=0;DP=292;ExcessHet=3.0103;FS=6.978;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=7.03;ReadPosRankSum=-1.813;SOR=0.652      GT:AD:ADO:DP:GQ:PL  0/1:144,148:0:292:99:3740

Thank you very much

bash awk grep vcf • 657 views
ADD COMMENTlink modified 9 months ago by dariober10k • written 9 months ago by ajuiwl30
2

Hello ajuiwl ,

why do you want to use awk and not a program that is already specilized on parsing and filtering a vcf like bcftools view?

If you realy need to use awk you will have to use something like match() and define a regex pattern.

fin swimmer

ADD REPLYlink written 9 months ago by finswimmer12k

Some programs change the way variants are normalized, the header, require installations, require dependencies, and other issues that bash code can avoid, moreover I have more control over what I do.

ADD REPLYlink written 9 months ago by ajuiwl30
1

Hello again,

I like bash as well. But when it comes to specific tasks where already widely used program exists, I strongly recommend using this. If reinventing the wheel, there is a high chance that you miss something very important. See the discussion about the perl solution below. The problems discussed there are valid for your solution as well.

I would suggest you take a look at bioconda. Using this you will have (nearly) no problems with installations and dependencies. (See also the first part of this tutorial a have written: Creating workflows with snakemake and conda )

fin swimmer

ADD REPLYlink written 9 months ago by finswimmer12k
1

That is true, I just found that there are some variants in the vcf that didn't have INFO about QD nor DP, and the commands I did didn't considered this. So I agree with you using already made software in that sense. In addition I will add that sometimes you are working in a server that requires the software to be installed by an admin and it may take some time to get it installed. And that doing it by yourself, you spend more time but you can discover things that you had not considered before, increasing your knowledge about the topic. Thank you for your time!!

ADD REPLYlink written 9 months ago by ajuiwl30

Hi, try this line; perl -ne 'print $_ if /QD=(\d+)/ && $1>8' test.txt |less -S

ADD REPLYlink modified 9 months ago by finswimmer12k • written 9 months ago by afli190

This answer does not consider decimal positions, in the way that 8.1 won't be kept.

ADD REPLYlink written 9 months ago by ajuiwl30

Sorry, but this is easy to solve. perl -ne 'print $_ if /QD=(\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLYlink written 9 months ago by afli190

Just a couple of notes, if I'm not mistaken... the perl script will not interpret correctly tags ending in QD, e.g. a line with XQD=10;QD=2 will be incorrectly returned. Also, negative numbers are not parsed correctly. If you filter for QD > -10, a line with QD=-1 will be missed.

ADD REPLYlink written 9 months ago by dariober10k

OK, let's modify this further: perl -ne 'print $_ if /;QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLYlink modified 9 months ago • written 9 months ago by afli190

Sorry to be a pain... This fails if the QD tag is the first one in the INFO field since in such case it will not be preceded by ;.

ADD REPLYlink modified 9 months ago • written 9 months ago by dariober10k

Never mind. If so, you can add another '?' : perl -ne 'print $_ if /;?QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

It is just a solution for simple data as supplied in the post, with more complex data, we can do some modification of this, or use other better methods as below.

ADD REPLYlink modified 9 months ago • written 9 months ago by afli190
2
gravatar for ajuiwl
9 months ago by
ajuiwl30
ajuiwl30 wrote:

Thank you for your answers, I found the way of doing it with:

grep -Po 'QD=.*' $FILE | cut -d";" -f1 | sed -r 's/[QD\=]//g' | paste -d'\t' $FILE  - | awk 'BEGIN{OFS="\t"}{if ($11>8) {$11=""; print}}'
ADD COMMENTlink modified 9 months ago • written 9 months ago by ajuiwl30
1

Hello ajuiwl,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

Thank you!

ADD REPLYlink written 9 months ago by finswimmer12k
2
gravatar for dariober
9 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

If you want a pure awk solution, what about this?

awk -v FS="\t" '{
    qd=$8; 
    sub(/(.*;QD=)|(QD=)/, "", qd); 
    sub(/;.*/, "", qd); 
    qd= qd+0; 
    if($1 ~ "^#" || qd > 8) {
        print $0
    }
}' $FILE

However, I would go for bcftools view as suggested by @finswimmer.

ADD COMMENTlink written 9 months ago by dariober10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1457 users visited in the last hour