Remove variants from VCF by INFO tag
1
0
Entering edit mode
8.4 years ago
bioroma.spb ▴ 50

Hello everyone,

I have a whole folder of VCF's generated by GATK CombineVariants. I want to remove variants (entire rows) containing ":R" or ":F" (but not ":FR") strings in INFO column. What is the best way to do this?

vcf GATK awk • 2.7k views
ADD COMMENT
3
Entering edit mode
8.4 years ago

You could use a simple AWK command (assuming the INFO column is the 8th column):

`awk '$8!~/:R/ && $8!~/:F[^R]*$/` FILE.vcf > FILE_updated.vcf

Removes all lines with either :R or :F (unless :FR)

If you want to do it for all your VCF files:

for file in `ls *.vcf`; do **awk command above** $file > ${file%%.vcf}_updated.vcf ; done

Iterates through all the .vcf files in your current directory.

ADD COMMENT
1
Entering edit mode

Thank you! Problem solved.

ADD REPLY
1
Entering edit mode

UPD: I've encountered another problem: command you wrote leaves rows with :F at the end of the column. Do you have any suggestions why?

ADD REPLY
1
Entering edit mode

I've updated the AWK command to take that into account. The awk commands interprets '[^R]' as "any term that isn't R". So if ':F' is at the end of the field, it will not exclude it because it is expecting a term that isn't there. I have fixed this issue by writing '[^R]*$' instead. The asterix stands for "0 or more" and the dollar sign stands for the end of the field. It will therefore remove lines with ':F' if it's at the end of the field or otherwise anything that isn't ':FR'.

ADD REPLY
1
Entering edit mode

Thank you again! Now everything works well.

ADD REPLY

Login before adding your answer.

Traffic: 874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6