Hello everyone,
I have a whole folder of VCF's generated by GATK CombineVariants. I want to remove variants (entire rows) containing ":R" or ":F" (but not ":FR") strings in INFO column. What is the best way to do this?
Hello everyone,
I have a whole folder of VCF's generated by GATK CombineVariants. I want to remove variants (entire rows) containing ":R" or ":F" (but not ":FR") strings in INFO column. What is the best way to do this?
You could use a simple AWK command (assuming the INFO column is the 8th column):
`awk '$8!~/:R/ && $8!~/:F[^R]*$/` FILE.vcf > FILE_updated.vcf
Removes all lines with either :R or :F (unless :FR)
If you want to do it for all your VCF files:
for file in `ls *.vcf`; do **awk command above** $file > ${file%%.vcf}_updated.vcf ; done
Iterates through all the .vcf files in your current directory.
UPD: I've encountered another problem: command you wrote leaves rows with :F at the end of the column. Do you have any suggestions why?
I've updated the AWK command to take that into account. The awk commands interprets '[^R]' as "any term that isn't R". So if ':F' is at the end of the field, it will not exclude it because it is expecting a term that isn't there. I have fixed this issue by writing '[^R]*$' instead. The asterix stands for "0 or more" and the dollar sign stands for the end of the field. It will therefore remove lines with ':F' if it's at the end of the field or otherwise anything that isn't ':FR'.