I’m trying to make an Fst table from my ddRAD data that is currently in VCF format. I’ve done this once before but I’m sure there are better, less convoluted ways to show Fst values. Is there a package (in R) or program that is recommended to make a table showing Fst. I used hierfstat in R once but it seemed overly complicated. I have multiple populations, around 8 populations! Thank you.
For ddRAD data in VCF format with multiple populations (e.g., your 8 pops), the dartR package in R is a great choice—it's specifically designed for reduced-representation sequencing like ddRAD/DArTseq, handles VCF import nicely, and makes pairwise Fst calculations straightforward without the hassle you saw in hierfstat. It outputs a tidy table of pairwise Fst values (plus bootstraps for CIs if needed) and integrates well with filtering steps if you want to QC your SNPs first.
Quick Setup
Install dartR (and dependencies) if you haven't:
install.packages("dartR")
# Or from GitHub for the latest: devtools::install_github("owensgl/dartR.base")
Workflow
Import your VCF into a genlight object (dartR's format). You'll need a population map file (simple tab-delimited text: first column sample IDs matching your VCF, second column pop names—e.g., "Pop1", "Pop2", etc. for your 8 groups).
Example popmap.txt:
Sample1 Pop1
Sample2 Pop1
Sample3 Pop2
...
Code:
library(dartR)
# Read VCF (assumes gzipped; adjust path as needed)
gl <- gl.read.vcf(vcf.file = "your_data.vcf.gz",
pop.file = "popmap.txt", # your pop assignments
n.cores = 4) # parallelize if big dataset
# Optional: Quick summary to check import
gl.report(gl)
(Optional but recommended) Filter SNPs/individuals for quality—e.g., call rate >80%, minor allele freq >5%. This avoids biases in Fst.
gl_filt <- gl.filter.callrate(gl, threshold = 0.8)
gl_filt <- gl.filter.maf(gl_filt, threshold = 0.05)
gl_filt <- gl.filter.monomorphs(gl_filt) # drop monomorphic loci
Calculate pairwise Fst. This uses Weir & Cockerham's estimator and can include bootstraps (e.g., 100 reps for 95% CIs).
# Pairwise Fst table (no bootstrap for speed)
fst_table <- gl.fst.pop(gl_filt, nboot = 0)
print(fst_table) # Outputs a matrix-like table of pairwise Fst
# With bootstraps for CIs (slower but better)
fst_boot <- gl.fst.pop(gl_filt, nboot = 100)
print(fst_boot) # Includes mean Fst, lower/upper CIs per pair
The output is a clean data frame/matrix you can export as CSV (write.csv(fst_table, "fst_pairs.csv")) or plot as a heatmap (e.g., via gl.fst.plot(fst_boot) in dartR, or pheatmap).
If your VCF is huge, this runs efficiently with parallelization. For more options (e.g., overall Fst across all pops), check gl.fst.overall(). Docs and vignettes are solid: ?gl.fst.pop or https://cran.r-project.org/package=dartR.
That worked, thank you! I’m confused about the output of the p-values. I have two that are above zero and less than .05 but the other values all show 0. Do you know why this is?
Thanks for the follow-up - that's a common 'gotcha' in dartR.
The p-values are derived from bootstrapping - they're the proportion of bootstrap replicates. A p-value of exactly 0 means none of your nboot replicates hit that threshold; so the observed is really highly statistically significant - strong population structure. It's not an error or lack of significance; R just rounds tiny values to 0 for display.
If you increase nboot to 1000+, you'll see more granular p-values's (e.g., 0.001), but it'll take longer to run.
We are working at the limits of machine level code here :)
That worked, thank you! I’m confused about the output of the p-values. I have two that are above zero and less than .05 but the other values all show 0. Do you know why this is?
Fantastic!
Thanks for the follow-up - that's a common 'gotcha' in dartR.
The p-values are derived from bootstrapping - they're the proportion of bootstrap replicates. A p-value of exactly 0 means none of your
nbootreplicates hit that threshold; so the observed is really highly statistically significant - strong population structure. It's not an error or lack of significance; R just rounds tiny values to 0 for display.If you increase
nbootto 1000+, you'll see more granular p-values's (e.g., 0.001), but it'll take longer to run.We are working at the limits of machine level code here :)