I have a set of vcf files that were filtered using GATK hard filtering. I filtered the snps and then the indels seperately and then merged the two and got a vcf with a list of polymorphisms in which the polymorphisms that had failed the filters weere marked as such (of course) Now I would like to make a vcf that lacks the snps and Indels that failed the filtering. What command should I run.
If you're referring to the
PASS flag, you can use anything from
vcftools to plain
awk to plainer
grep. If there's more to the PASS criteria than just the flag, you're going to need to elaborate on that.
Also, do you mean
variants when you say
polymorphisms? I know people may use the terms interchangeably, but they do not mean the same thing.
Variants are the most generic descriptor; they refer to all loci where multiple alleles are found, whereas
polymorphisms assume the functional impact of the variant to result in a polymorphic phenotype that is not usually pathogenic.
There are different ways to handle the filtering of the variants from the GATK vcf file once you
indels . Having said that
snp does not mean the true sense of term. They are actually point variants. You can use a multiple number of hard filtering strategies based on the distribution of
DP,SB, QUAL ,QD, FS ,MQ , MQRankSum, ReadPosRankSum , DP scores, you can estimate the distribution of these scores and then use hard filtering on them to asses the high quality variants from your data , either you can take all of them in consideration or some of them to filter out high quality variants. The GATK handle which you use is -
-filterEpxression and the ones in the new
PASS are the ones that goes downstream rest are filtered with your filtering strategy. So select them with
grep or when
vcftools as @Ram said. These variants can farther be annotated to associate them with functionality or structural impact scores with
ANNOVAR, VAAST,VEP or
CRAVAT. I hope this answers you queries.
Ram has already pointed out the solution. this is just to expand that answer a little bit and to provide you with a few examples since you are interested in the particular commands needed.
after a GATK's filtering process, the FILTER column gets filled with labels indicating whether the variant fulfilled all the requirements (the label would be PASS) or not (the label would be any other). removing variants that didn't reach the hard filters' thresholds is the same as filtering PASS only variants, so you can use a generic tool for parsing text (
awk,...) or a tool that deals with vcf files natively (
bcftools for instance).
an example of the first ones would be the following:
grep ^# file.vcf > file.filtered.vcf grep PASS file.vcf >> file.filtered.vcf
an example of the second ones would be the following:
bcftools view -Oz -f .,PASS file.vcf.gz > file.filtered.vcf.gz
bcftools requires the vcf file to be previously
bgzip compressed and
tabix indexed. if filtering by PASS label is all you need you may probably prefer to use the simple yet fast text parsing options, but have in mind that
bcftools is very fast too (faster than
vcftools indeed) plus it allows to build your requirements for the filtered file very easily, even if those requirements are complex.