How To Perform A Fst (Fixation Index) Calculation Over A Vcf File Using A Sliding Window?
2
2
Entering edit mode
8.5 years ago
Rubal7 ▴ 800

Hello Everyone,

I have 2 vcf files, each one contains snps from samples from a single different population. I would like to run fst in windows along the genome to compare the populations. I am familiar with using vcf tools and therefore would like to run their weir-fst option. However currently this compares entire vcf files. What would be the best was to run this on smaller windows. I have basic programming skills. I could imagine running it in a loop specifying different predetermined regions. Preferably I would like the output in one file with region coordinates and fst score, for plotting. If there are straightforward ways to do this for VCF format data not using vcf tools I would also be interested.

Rubal

vcf vcftools • 11k views
0
Entering edit mode

You can use the --snps option to select only a subset of SNPs. Thus, you can generate a list of all the SNPs in the two files, split them in sliding windows, and then call vcftools multiple times to calculate Fst for each window. It is a bit of overhead, but it should work..

10
Entering edit mode
8.5 years ago

In principle, it would be better to use another tool to do this. For example, have a look at BioPerl and R. Vcftools is designed to parse and filter data, and the options to calculate Fst are still experimental.

Nevertheless, I like challenges :-) so here is a command that will use the GNU/parallel tool to calculate Fst using vcftools, calling it separately for each window of 7 SNPs:

grep -v '^#' ALG11.vcf | gawk '{print $3}' | parallel -N 7 -j1 "echo {} > param_{#}; vcftools --vcf ALG11.vcf --out ALG11_fst_{#} --hapmap-fst-file second_vcf_file.vcf --snps param_{#}  Explanation: Step1: grep -v '^#' ALG11.vcf | gawk '{print$3}'


This will extract all the SNP id in the file. Replace "ALG11.vcf" with the name of your file.

Step 2:

| parallel -N 7 -j1


This will call the GNU/parallel tool, which here is used to calculate Fst in parallel. Note that this must be the GNU/parallel tool, and not the standard parallel tool that comes installed with coreutils. If you have trouble installing it, I can rewrite the code for xargs, or for the other parallel tool, but I recommend you to install GNU/parallel.

The -N7 option is used to split into windows of 7 SNPs. If you prefer another window size, just use another number here.

Step 3:

"echo {} > param_{#}; vcftools --vcf ALG11.vcf --out ALG11_fst_{#} --hapmap-fst-file second_vcf_file.vcf  --snps param_{#}


This is the argument to GNU/parallel. Basically, it saves the SNPs to a param_ file, and then call vcftools to calculate Fst. You may want to add an option to remove all the params file afterwards.

0
Entering edit mode

UPDATE: my answer is out of date. VCFtools now contains an option for sliding windows Fst. See Adam's answer for an example.

9
Entering edit mode
8.5 years ago

There is actually an option for doing this in the SVN version of vcftools, which has not yet made it through to the website. Specifically, the commands:

• --fst-window-size <integer>
• --fst-window-step <integer>

can be used to specify window and step sizes with either the --weir-fst-pop or --hapmap-fst-pop options.

1
Entering edit mode

Hi, For Fst calculation , what is good window-size and widow-step we should choose normally, is that in bps or Kbps unit?

Thanks!

0
Entering edit mode

This solution is much better than mine. It reads the file only once, and allows for overlapping wingows. It should be marked as the correct one :-)

0
Entering edit mode

this is great, thanks for pointing it out!