Filter genotype in multi-sample VCF file
3
2
Entering edit mode
9.5 years ago
guillaume.rbt ★ 1.0k

Hello,

I've done a multi-sample SNP calling with samtools mpileup, and my results are in a vcf file.

I would like to keep only SNPs detected on all of my samples (without any sample with GT flag = 0/0 ,(homozygote for the reference allele)).

Does anyone know a tool that could do it?

Thanks!

SNP • 12k views
ADD COMMENT
3
Entering edit mode
vcffilter -g "! (GT = 0/0 )"  Input.vcf
ADD REPLY
0
Entering edit mode

Um, What's the vcffilter binary a part of?

EDIT: NVM, I found it on GitHub. vcflib

ADD REPLY
8
Entering edit mode
9.5 years ago

I usually choose vcftools or bcftools to deal with vcf files, but in your case a simple grep would do:

grep -v "0[/|]0" <in.vcf >out.vcf
ADD COMMENT
0
Entering edit mode

indeed, a grep works just fine, good idea! thank you all

ADD REPLY
0
Entering edit mode

usually vcf files require vcf software, although sometimes you just have to (carefully) go for the simplest solution, and simple bash one-liners may be of great help. in this particular case, grep should even be faster than any other proper-vcf-handling software.

ADD REPLY
0
Entering edit mode

I agree with the approach, but it would be prudent to watch out for false matches. We are, after all, just matching string expressions and not ensuring that the 0[/|]0 matches the genotype data specifically.

ADD REPLY
2
Entering edit mode

in a properly formatted vcf file, the "0[/|]0" pattern will only match reference homozigotes in columns 10 and on. there's no way that pattern can appear in the headers nor in the previous 1-9 columns, so that's why grep -v "0[/|]0" is so convenient if you want to extract all the variant sites where all samples do vary, because it not only filters exactly what you don't want, but it also outputs a properly formatted vcf file too. if you want to be more strict, you could use a little more complex pattern such as "0[/|]0:" if you are sure that the genotypes' format is always "GT:" plus something else, which could not be always the case. only the GT field is mandatory in the vcf format.

ADD REPLY
1
Entering edit mode
9.5 years ago
Ram 43k

vcftools should help you out.

Here's the genotype filtering options on vcftools

ADD COMMENT
0
Entering edit mode

Is there any way to keep snps that are present at least in n number of samples ?

ADD REPLY
1
Entering edit mode
9.5 years ago
iraun 6.2k

If you are not very familiar with programming (this problem can be solved with awk/perl scripting), I suggest you SnpSift tool. In my opinion it is an user friendly tool, very intuitive, and with a great examples in the web page that can guide you. You can filter VCF files according to your needs.

ADD COMMENT

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6