Question

Filter VCF with bash commands

1

Entering edit mode

6.7 years ago

ajuiwl ▴ 30

Hi, I have been trying to manually filter a vcf file by QD score, but I have not been able to do it properly. I thought of using awk and indicate that a specific field has a value greater than 8 (that is the value I want to filter it), but I have records where the QD variable does not occupy the same position.

1       2422614 .       G       A       3711.77 .       AF=1;DP=120;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.93;SOR=1.773      GT:AD:ADO:DP:GQ:PL      1/1:0,120:0:120:99:0

1       2430617 .       C       T       2052.76 .       AF=0.5;BaseQRankSum=-1.791;ClippingRankSum=0;DP=292;ExcessHet=3.0103;FS=6.978;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=7.03;ReadPosRankSum=-1.813;SOR=0.652      GT:AD:ADO:DP:GQ:PL  0/1:144,148:0:292:99:3740

Thank you very much

vcf awk grep bash • 6.0k views

ADD COMMENT • link updated 6.7 years ago by dariober 15k • written 6.7 years ago by ajuiwl ▴ 30

2

Entering edit mode

Hello ajuiwl ,

why do you want to use awk and not a program that is already specilized on parsing and filtering a vcf like bcftools view?

If you realy need to use awk you will have to use something like match() and define a regex pattern.

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Some programs change the way variants are normalized, the header, require installations, require dependencies, and other issues that bash code can avoid, moreover I have more control over what I do.

ADD REPLY • link 6.7 years ago by ajuiwl ▴ 30

1

Entering edit mode

Hello again,

I like bash as well. But when it comes to specific tasks where already widely used program exists, I strongly recommend using this. If reinventing the wheel, there is a high chance that you miss something very important. See the discussion about the perl solution below. The problems discussed there are valid for your solution as well.

I would suggest you take a look at bioconda. Using this you will have (nearly) no problems with installations and dependencies. (See also the first part of this tutorial a have written: Creating workflows with snakemake and conda )

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

1

Entering edit mode

That is true, I just found that there are some variants in the vcf that didn't have INFO about QD nor DP, and the commands I did didn't considered this. So I agree with you using already made software in that sense. In addition I will add that sometimes you are working in a server that requires the software to be installed by an admin and it may take some time to get it installed. And that doing it by yourself, you spend more time but you can discover things that you had not considered before, increasing your knowledge about the topic. Thank you for your time!!

ADD REPLY • link 6.7 years ago by ajuiwl ▴ 30

0

Entering edit mode

Hi, try this line; perl -ne 'print $_ if /QD=(\d+)/ && $1>8' test.txt |less -S

ADD REPLY • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by afli ▴ 190

0

Entering edit mode

This answer does not consider decimal positions, in the way that 8.1 won't be kept.

ADD REPLY • link 6.7 years ago by ajuiwl ▴ 30

0

Entering edit mode

Sorry, but this is easy to solve. perl -ne 'print $_ if /QD=(\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLY • link 6.7 years ago by afli ▴ 190

0

Entering edit mode

Just a couple of notes, if I'm not mistaken... the perl script will not interpret correctly tags ending in QD, e.g. a line with XQD=10;QD=2 will be incorrectly returned. Also, negative numbers are not parsed correctly. If you filter for QD > -10, a line with QD=-1 will be missed.

ADD REPLY • link 6.7 years ago by dariober 15k

0

Entering edit mode

OK, let's modify this further: perl -ne 'print $_ if /;QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLY • link 6.7 years ago by afli ▴ 190

0

Entering edit mode

Sorry to be a pain... This fails if the QD tag is the first one in the INFO field since in such case it will not be preceded by ;.

ADD REPLY • link 6.7 years ago by dariober 15k

0

Entering edit mode

Never mind. If so, you can add another '?' : perl -ne 'print $_ if /;?QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

It is just a solution for simple data as supplied in the post, with more complex data, we can do some modification of this, or use other better methods as below.

ADD REPLY • link 6.7 years ago by afli ▴ 190

0

Entering edit mode

A simpler yet general solution would be this one:

perl -ne 'print if /QD=([-\d.]+)/ and $1>8' file.vcf

If you don't want to lose the headers in the filtering process, then you should go for this one:

perl -ne 'print if /^#/ or (/QD=([-\d.]+)/ and $1>8)' file.vcf

Although I definitely would go for bcftools for VCF filtering.

ADD REPLY • link 4.6 years ago by Jorge Amigo 14k

2

Entering edit mode

6.7 years ago

dariober 15k

If you want a pure awk solution, what about this?

awk -v FS="\t" '{
    qd=$8; 
    sub(/(.*;QD=)|(QD=)/, "", qd); 
    sub(/;.*/, "", qd); 
    qd= qd+0; 
    if($1 ~ "^#" || qd > 8) {
        print $0
    }
}' $FILE

However, I would go for bcftools view as suggested by @finswimmer.

ADD COMMENT • link 6.7 years ago by dariober 15k

score 2 · Accepted Answer · 2018-11-06

2

Entering edit mode

6.7 years ago

ajuiwl ▴ 30

Thank you for your answers, I found the way of doing it with:

grep -Po 'QD=.*' $FILE | cut -d";" -f1 | sed -r 's/[QD\=]//g' | paste -d'\t' $FILE  - | awk 'BEGIN{OFS="\t"}{if ($11>8) {$11=""; print}}'

ADD COMMENT • link 6.7 years ago by ajuiwl ▴ 30

1

Entering edit mode

Hello ajuiwl,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

Thank you!

ADD REPLY • link 6.7 years ago by finswimmer 16k