Question: Filter genotype in multi-sample VCF file
1
gravatar for guillaume.rbt
3.5 years ago by
guillaume.rbt430
France
guillaume.rbt430 wrote:

Hello,

I've done a multi-sample SNP calling with samtools mpileup, and my results are in a vcf file.

I would like to keep only SNPs detected on all of my samples (without any sample with GT flag = 0/0 ,(homozygote for the reference allele)).

Does anyone know a tool that could do it?

 

Thanks!

snp • 5.8k views
ADD COMMENTlink modified 3.5 years ago by Jorge Amigo10k • written 3.5 years ago by guillaume.rbt430
1
vcffilter -g "! (GT = 0/0 )"  Input.vcf
ADD REPLYlink modified 3.2 years ago • written 3.5 years ago by Rm7.6k

Um, What's the vcffilter binary a part of?

EDIT: NVM, I found it on GitHub. vcflib

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Ram15k
7
gravatar for Jorge Amigo
3.5 years ago by
Jorge Amigo10k
Santiago de Compostela, Spain
Jorge Amigo10k wrote:

I usually choose vcftools or bcftools to deal with vcf files, but in your case a simple grep would do:

grep -v "0[/|]0" <in.vcf >out.vcf
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Jorge Amigo10k

indeed, a grep works just fine, good idea! thank you all

ADD REPLYlink written 3.5 years ago by guillaume.rbt430

usually vcf files require vcf software, although sometimes you just have to (carefully) go for the simplest solution, and simple bash one-liners may be of great help. in this particular case, grep should even be faster than any other proper-vcf-handling software.

ADD REPLYlink written 3.5 years ago by Jorge Amigo10k

I agree with the approach, but it would be prudent to watch out for false matches. We are, after all, just matching string expressions and not ensuring that the 0[/|]0 matches the genotype data specifically.

ADD REPLYlink written 3.5 years ago by Ram15k
1

in a properly formatted vcf file, the "0[/|]0" pattern will only match reference homozigotes in columns 10 and on. there's no way that pattern can appear in the headers nor in the previous 1-9 columns, so that's why grep -v "0[/|]0" is so convenient if you want to extract all the variant sites where all samples do vary, because it not only filters exactly what you don't want, but it also outputs a properly formatted vcf file too. if you want to be more strict, you could use a little more complex pattern such as "0[/|]0:" if you are sure that the genotypes' format is always "GT:" plus something else, which could not be always the case. only the GT field is mandatory in the vcf format.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Jorge Amigo10k
1
gravatar for Ram
3.5 years ago by
Ram15k
New York
Ram15k wrote:

vcftools should help you out.

Here's the genotype filtering options on vcftools

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Ram15k

Is there any way to keep snps that are present at least in n number of samples ?

ADD REPLYlink written 3.3 years ago by geek_y8.5k
1
gravatar for iraun
3.5 years ago by
iraun3.3k
Norway
iraun3.3k wrote:

If you are not very familiar with programming (this problem can be solved with awk/perl scripting), I suggest you SnpSift tool (http://snpeff.sourceforge.net/SnpSift.html#filter). In my opinion it is an user friendly tool, very intuitive, and with a great examples in the web page that can guide you. You can filter VCF files according to your needs.

ADD COMMENTlink written 3.5 years ago by iraun3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 835 users visited in the last hour