Question: How To Perform A Fst (Fixation Index) Calculation Over A Vcf File Using A Sliding Window?
2
gravatar for Rubal7
6.5 years ago by
Rubal7770
Rubal7770 wrote:

Hello Everyone,

I have 2 vcf files, each one contains snps from samples from a single different population. I would like to run fst in windows along the genome to compare the populations. I am familiar with using vcf tools and therefore would like to run their weir-fst option. However currently this compares entire vcf files. What would be the best was to run this on smaller windows. I have basic programming skills. I could imagine running it in a loop specifying different predetermined regions. Preferably I would like the output in one file with region coordinates and fst score, for plotting. If there are straightforward ways to do this for VCF format data not using vcf tools I would also be interested.

Thanks in Advance!

Rubal

vcf vcftools • 9.5k views
ADD COMMENTlink modified 5.7 years ago by Biostar ♦♦ 20 • written 6.5 years ago by Rubal7770

You can use the --snps option to select only a subset of SNPs. Thus, you can generate a list of all the SNPs in the two files, split them in sliding windows, and then call vcftools multiple times to calculate Fst for each window. It is a bit of overhead, but it should work..

ADD REPLYlink written 6.5 years ago by Giovanni M Dall'Olio26k
10
gravatar for Giovanni M Dall'Olio
6.5 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

In principle, it would be better to use another tool to do this. For example, have a look at BioPerl and R. Vcftools is designed to parse and filter data, and the options to calculate Fst are still experimental.

Nevertheless, I like challenges :-) so here is a command that will use the GNU/parallel tool to calculate Fst using vcftools, calling it separately for each window of 7 SNPs:

grep -v '^#' ALG11.vcf | gawk '{print $3}'  | parallel -N 7 -j1 "echo {} > param_{#}; vcftools --vcf ALG11.vcf --out ALG11_fst_{#} --hapmap-fst-file second_vcf_file.vcf  --snps param_{#}

Explanation:

Step1:

grep -v '^#' ALG11.vcf | gawk '{print $3}'

This will extract all the SNP id in the file. Replace "ALG11.vcf" with the name of your file.

Step 2:

| parallel -N 7 -j1

This will call the GNU/parallel tool, which here is used to calculate Fst in parallel. Note that this must be the GNU/parallel tool, and not the standard parallel tool that comes installed with coreutils. If you have trouble installing it, I can rewrite the code for xargs, or for the other parallel tool, but I recommend you to install GNU/parallel.

The -N7 option is used to split into windows of 7 SNPs. If you prefer another window size, just use another number here.

Step 3:

"echo {} > param_{#}; vcftools --vcf ALG11.vcf --out ALG11_fst_{#} --hapmap-fst-file second_vcf_file.vcf  --snps param_{#}

This is the argument to GNU/parallel. Basically, it saves the SNPs to a param_ file, and then call vcftools to calculate Fst. You may want to add an option to remove all the params file afterwards.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Giovanni M Dall'Olio26k

UPDATE: my answer is out of date. VCFtools now contains an option for sliding windows Fst. See Adam's answer for an example.

ADD REPLYlink written 3.8 years ago by Giovanni M Dall'Olio26k
9
gravatar for Adam
6.5 years ago by
Adam990
United States
Adam990 wrote:

There is actually an option for doing this in the SVN version of vcftools, which has not yet made it through to the website. Specifically, the commands:

  • --fst-window-size <integer>
  • --fst-window-step <integer>

can be used to specify window and step sizes with either the --weir-fst-pop or --hapmap-fst-pop options.

ADD COMMENTlink written 6.5 years ago by Adam990
1

Hi, For Fst calculation , what is good window-size and widow-step we should choose normally, is that in bps or Kbps unit?

Thanks!

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by SOHAIL300

This solution is much better than mine. It reads the file only once, and allows for overlapping wingows. It should be marked as the correct one :-)

ADD REPLYlink written 6.5 years ago by Giovanni M Dall'Olio26k

this is great, thanks for pointing it out!

ADD REPLYlink written 6.5 years ago by Rubal7770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1293 users visited in the last hour