Question: How do I remove polymorphisms that failed filtering from a vcf file?
gravatar for william.brown
3.0 years ago by
william.brown10 wrote:

I have a set of vcf files that were filtered using GATK hard filtering. I filtered the snps and then the indels seperately and then merged the two and got a vcf with a list of polymorphisms in which the polymorphisms that had failed the filters weere marked as such (of course) Now I would like to make a vcf that lacks the snps and Indels that failed the filtering. What command should I run.

snp next-gen genome • 2.3k views
ADD COMMENTlink modified 3.0 years ago by Jorge Amigo11k • written 3.0 years ago by william.brown10
awk '/^#/||$7=="PASS"' in.vcf > out.vcf

Very basic programming skills may make a huge difference. Learn programming.

ADD REPLYlink written 3.0 years ago by lh331k

What command should I run

A better approach would be "What logic/process/algorithm would be ideal here?"

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by RamRS21k

yes again you are correct; thanks!

ADD REPLYlink written 3.0 years ago by william.brown10
gravatar for RamRS
3.0 years ago by
Houston, TX
RamRS21k wrote:

If you're referring to the PASS flag, you can use anything from vcftools to plain awk to plainer grep. If there's more to the PASS criteria than just the flag, you're going to need to elaborate on that.

Also, do you mean variants when you say polymorphisms? I know people may use the terms interchangeably, but they do not mean the same thing. Variants are the most generic descriptor; they refer to all loci where multiple alleles are found, whereas polymorphisms assume the functional impact of the variant to result in a polymorphic phenotype that is not usually pathogenic.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by RamRS21k

thanks; yes I meant variants

ADD REPLYlink written 3.0 years ago by william.brown10
gravatar for ivivek_ngs
3.0 years ago by
Seattle,WA, USA
ivivek_ngs4.8k wrote:

There are different ways to handle the filtering of the variants from the GATK vcf file once you SelectVariants for snp and indels . Having said that snp does not mean the true sense of term. They are actually point variants. You can use a multiple number of hard filtering strategies based on the distribution of DP,SB, QUAL ,QD, FS ,MQ , MQRankSum, ReadPosRankSum , DP scores, you can estimate the distribution of these scores and then use hard filtering on them to asses the high quality variants from your data , either you can take all of them in consideration or some of them to filter out high quality variants. The GATK handle which you use is -VariantFiltration and -filterEpxression and the ones in the new vcf shows PASS are the ones that goes downstream rest are filtered with your filtering strategy. So select them with awk or grep or when vcftools as @Ram said. These variants can farther be annotated to associate them with functionality or structural impact scores with ANNOVAR, VAAST,VEP or CRAVAT. I hope this answers you queries.

ADD COMMENTlink written 3.0 years ago by ivivek_ngs4.8k

Thanks for the comments:# I used the -VariantFiltration walker as you suggested and followed the guidelines on the site for the indels: --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_snp_filter" and for the snps --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter"

it looks OK; I lost a few: mainly snps from the MQ filter. Its a 15Mb haploid genome that I am working on. I have looked the variants in the bam files using igv and it all looks sound; the reads are almost always 100% consistent.

ADD REPLYlink written 3.0 years ago by william.brown10

It is good to know that they are working now for you, am sure now you can just streamline to the variants with PASS filters and proceed with annotation. As a matter of fact I would ask you to kindly upvote and accept the answers that seemed to be beneficial for you so that others can use it for future reference.

ADD REPLYlink written 3.0 years ago by ivivek_ngs4.8k
gravatar for Jorge Amigo
3.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

Ram has already pointed out the solution. this is just to expand that answer a little bit and to provide you with a few examples since you are interested in the particular commands needed.

after a GATK's filtering process, the FILTER column gets filled with labels indicating whether the variant fulfilled all the requirements (the label would be PASS) or not (the label would be any other). removing variants that didn't reach the hard filters' thresholds is the same as filtering PASS only variants, so you can use a generic tool for parsing text (grep, sed, awk,...) or a tool that deals with vcf files natively (vcftools or bcftools for instance).

an example of the first ones would be the following:

grep ^# file.vcf > file.filtered.vcf
grep PASS file.vcf >> file.filtered.vcf

an example of the second ones would be the following:

bcftools view -Oz -f .,PASS file.vcf.gz > file.filtered.vcf.gz

note that bcftools requires the vcf file to be previously bgzip compressed and tabix indexed. if filtering by PASS label is all you need you may probably prefer to use the simple yet fast text parsing options, but have in mind that bcftools is very fast too (faster than vcftools indeed) plus it allows to build your requirements for the filtered file very easily, even if those requirements are complex.

ADD COMMENTlink written 3.0 years ago by Jorge Amigo11k

Thanks Jorge, that was helpful, I wanted to use vcftools but when I looked here I found the documentation for the different tools hard to understand. I wanted something like samtools view or bcftools view as you suggested but it was not really obvious which was the right one. In the end I ended up grepping the files but I lost the header which is now causing problems so I would be grateful for some idea as to where some good examples of vcftools can be found. The GATK web site describing their walkers seems perfect. Thanks again William

ADD REPLYlink written 3.0 years ago by william.brown10

you're probably not using the concatenation ">>" on the second grep, so the headers get overwritten. Heng's awk solution is pretty straightforward and very recommendable.

ADD REPLYlink written 3.0 years ago by Jorge Amigo11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 822 users visited in the last hour