Question: vcftools: unwanted filtering
0
gravatar for Molly_K
9 months ago by
Molly_K40
United States
Molly_K40 wrote:

I am using vcftools to breakdown a large VCF file into smaller files using -

for i in `seq 1 22`; do vcftools --gzvcf ~/path_to_large.vcf.gz --chr "$i" --out ~/path_to_small_vcf --recode; done

This is the message I got after running this command (using chr22 as example)

VCFtools - 0.1.15 (C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
    --gzvcf /path_to_large_vaf/large.vcf.gz
    --chr 22
    --out /path_to_small_vcr/
    --recode

Using zlib version: 1.2.8 After filtering, kept 1000 out of 1000 Individuals Outputting VCF file... After filtering, kept 72353 out of a possible 2825214 Sites Run Time = 987.00 seconds

I got the results that were split into different chromosomes but I noticed there are a huge number of variants got filtered out from the original 2825214 sites (only 72353 remained). I did not specify any filtering criteria in the command, what are the potential cause of this filtering process?

A little more about the vcf file used

fileformat=VCFv4.2

source=PLINKv1.90

snp vcftools • 438 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by Molly_K40
1

Most programs will have default values for various program options (these would generally be listed in in-line help or in manuals). Perhaps one of the values is causing the filtering of the data here.

ADD REPLYlink written 9 months ago by genomax59k

@genomax, thank you for the suggestion. I did read the vcftool manual but there isn't anything describing what are the default filtering values. I thought I was just breaking down the large file to smaller files, it shouldn't be doing any filtering. I read through the options carefully but didn't see anything. http://vcftools.sourceforge.net/man_latest.html

ADD REPLYlink written 9 months ago by Molly_K40

I only mentioned that since people tend to overlook that fact at times. Sounds like this observation still needs a logical explanation. Is there a log file (other than the message above) to check through to see why those SNP's were filtered?

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax59k

Hi Molly_K,

No need to delete a question after you received (helpful) answers!

Cheers,
Wouter

ADD REPLYlink written 9 months ago by WouterDeCoster35k

I basically realized it's a misunderstanding on my end so I deleted the question, realizing it isn't even a good question, but if someone has the same doubt when interpreting the results, this post may be helpful :P

ADD REPLYlink written 9 months ago by Molly_K40
1

That basically is the idea (for not deleting posts once they have received comments/answers).

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax59k
1
gravatar for Molly_K
9 months ago by
Molly_K40
United States
Molly_K40 wrote:

I just realized that the larger number represents the SNPs that are not on the specified chromosome, they are not relevant lol.. thanks so much for thinking with me. I checked the output files (chr22) and compare with the original if I just use awk to get chr22, the row numbers are the same, ha. I will do it for a few other chromosomes too.

ADD COMMENTlink written 9 months ago by Molly_K40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1560 users visited in the last hour