Question: Filter VCF on Number of genotyped individuals
1
gravatar for mek362
2.7 years ago by
mek36210
mek36210 wrote:

Hello all,

I would like to know how to filter a VCF by number of individuals with genotype data (i.e. without missing data).

First, some background:

I have a large VCF file containing SNP data for 95 individuals called from bcftools mpileup output. BAMs passed to mpileup contained data for alignments of RNAseq reads to a de novo transcriptome. Thus, the VCF lists SNPs within each contig defined in the reference de novo transcriptome.

The 95 individuals were collected from 6 different populations. For population genetic analysis, I would like to produce a VCF that only contains SNPs at which there are data for at least 75% of individuals in each population (11 or 12).

I have a decent familiarity with bcftools, and know how to filter missing data, but I am not sure how to approach this. As expected, filtering on the N_SAMPLES variable alone doesn't seem to modify the VCF. Perhaps my best course of action would be to first split the VCF into population-specific files?

Any help with this would be greatly appreciated! Please let me know if I've failed to include any useful information.

transcriptome filtering snp vcf • 2.1k views
ADD COMMENTlink modified 2.7 years ago by Brice Sarver2.6k • written 2.7 years ago by mek36210
4
gravatar for Brice Sarver
2.7 years ago by
Brice Sarver2.6k
United States
Brice Sarver2.6k wrote:

VCFtools will do this:

vcftools --recode --vcf your_input.vcf --max-missing (proportion missing above which to filter: [0-1]) --out your_output.vcf

Edit: if you want to use this command per population, you're right: split first.

ADD COMMENTlink modified 2.6 years ago • written 2.7 years ago by Brice Sarver2.6k

Thanks for the answer, Brice. This worked well -- I split the master .vcf by population, filtered using vcftools --recode, and merged filtered population .vcf's using bcftools isec. It may also be useful to note that the --max-missing option excludes sites based on proportion of missing data, where a 0 indicates no filter and a 1 only passes sites with no missing data (i.e. 0.75 filters out sites missing over 25% of data).

ADD REPLYlink written 2.6 years ago by mek36210

Yep, you're correct. I assumed people would peek at the manual for specifics, but I've gone ahead and made the clarification in my post in case others take the call verbatim. Thanks!

ADD REPLYlink written 2.6 years ago by Brice Sarver2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1840 users visited in the last hour