How do I remove polymorphisms that failed filtering from a vcf file?
3
1
Entering edit mode
4.9 years ago

I have a set of vcf files that were filtered using GATK hard filtering. I filtered the snps and then the indels seperately and then merged the two and got a vcf with a list of polymorphisms in which the polymorphisms that had failed the filters weere marked as such (of course) Now I would like to make a vcf that lacks the snps and Indels that failed the filtering. What command should I run.

next-gen SNP genome • 3.7k views
2
Entering edit mode
awk '/^#/||\$7=="PASS"' in.vcf > out.vcf


Very basic programming skills may make a huge difference. Learn programming.

0
Entering edit mode

What command should I run

A better approach would be "What logic/process/algorithm would be ideal here?"

0
Entering edit mode

yes again you are correct; thanks!

2
Entering edit mode
4.9 years ago
Ram 32k

If you're referring to the PASS flag, you can use anything from vcftools to plain awk to plainer grep. If there's more to the PASS criteria than just the flag, you're going to need to elaborate on that.

Also, do you mean variants when you say polymorphisms? I know people may use the terms interchangeably, but they do not mean the same thing. Variants are the most generic descriptor; they refer to all loci where multiple alleles are found, whereas polymorphisms assume the functional impact of the variant to result in a polymorphic phenotype that is not usually pathogenic.

0
Entering edit mode

thanks; yes I meant variants

1
Entering edit mode
4.9 years ago
ivivek_ngs ★ 5.1k

There are different ways to handle the filtering of the variants from the GATK vcf file once you SelectVariants for snp and indels . Having said that snp does not mean the true sense of term. They are actually point variants. You can use a multiple number of hard filtering strategies based on the distribution of DP,SB, QUAL ,QD, FS ,MQ , MQRankSum, ReadPosRankSum , DP scores, you can estimate the distribution of these scores and then use hard filtering on them to asses the high quality variants from your data , either you can take all of them in consideration or some of them to filter out high quality variants. The GATK handle which you use is -VariantFiltration and -filterEpxression and the ones in the new vcf shows PASS are the ones that goes downstream rest are filtered with your filtering strategy. So select them with awk or grep or when vcftools as @Ram said. These variants can farther be annotated to associate them with functionality or structural impact scores with ANNOVAR, VAAST,VEP or CRAVAT. I hope this answers you queries.

0
Entering edit mode

Thanks for the comments:# I used the -VariantFiltration walker as you suggested and followed the guidelines on the site for the indels: --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_snp_filter" and for the snps --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter"

it looks OK; I lost a few: mainly snps from the MQ filter. Its a 15Mb haploid genome that I am working on. I have looked the variants in the bam files using igv and it all looks sound; the reads are almost always 100% consistent.

0
Entering edit mode

It is good to know that they are working now for you, am sure now you can just streamline to the variants with PASS filters and proceed with annotation. As a matter of fact I would ask you to kindly upvote and accept the answers that seemed to be beneficial for you so that others can use it for future reference.

1
Entering edit mode
4.9 years ago

Ram has already pointed out the solution. this is just to expand that answer a little bit and to provide you with a few examples since you are interested in the particular commands needed.

after a GATK's filtering process, the FILTER column gets filled with labels indicating whether the variant fulfilled all the requirements (the label would be PASS) or not (the label would be any other). removing variants that didn't reach the hard filters' thresholds is the same as filtering PASS only variants, so you can use a generic tool for parsing text (grep, sed, awk,...) or a tool that deals with vcf files natively (vcftools or bcftools for instance).

an example of the first ones would be the following:

grep ^# file.vcf > file.filtered.vcf
grep PASS file.vcf >> file.filtered.vcf


an example of the second ones would be the following:

bcftools view -Oz -f .,PASS file.vcf.gz > file.filtered.vcf.gz


note that bcftools requires the vcf file to be previously bgzip compressed and tabix indexed. if filtering by PASS label is all you need you may probably prefer to use the simple yet fast text parsing options, but have in mind that bcftools is very fast too (faster than vcftools indeed) plus it allows to build your requirements for the filtered file very easily, even if those requirements are complex.

0
Entering edit mode

Thanks Jorge, that was helpful, I wanted to use vcftools but when I looked here https://vcftools.github.io/perl_module.html I found the documentation for the different tools hard to understand. I wanted something like samtools view or bcftools view as you suggested but it was not really obvious which was the right one. In the end I ended up grepping the files but I lost the header which is now causing problems so I would be grateful for some idea as to where some good examples of vcftools can be found. The GATK web site describing their walkers seems perfect. Thanks again William

0
Entering edit mode

you're probably not using the concatenation ">>" on the second grep, so the headers get overwritten. Heng's awk solution is pretty straightforward and very recommendable.