Calculating pi and phi from VCF files on R
1
0
Entering edit mode
4.7 years ago
maho4822 • 0

Hello all,

I'm trying to calculate the pi and phi values for 3 different VCF files. I have attempted to download the packages VCFtools, DNAsp, and PopGenome for this. However, I've run into several problems. Neither VCFtools nor DNAsp are compatible with my version of R (version 4.0.3), and when I attempt to read in the vcf file using the readData command in PopGenome, I am met with a "cannot find path !" error. As of right now, my code line looks like this:

GENOME.class <- readData("VCF copies", format="VCF", gffpath="GFF")

I would like to be able to calculate pi and phi of these files on PopGenome. If I can't, though, which versions of R are compatible with VCFtools so that I might try to calculate these statistics there?

Thank you all very much. I'm pretty inexperienced with coding, so any help is greatly appreciated.

SNP VCF R phi pi • 1.9k views
ADD COMMENT
0
Entering edit mode

Have you found a solution to this yet? I am having the same issue!

ADD REPLY
0
Entering edit mode

"cannot find path

ADD REPLY
0
Entering edit mode

A bit late here, but for anyone reading this, this error likely comes from having a space in your filename.

ADD REPLY
0
Entering edit mode
14 hours ago

Hi maho4822,

The "cannot find path" error in PopGenome's readData is almost certainly due to the space in your folder name ("VCF copies"). Rename the folder to something without spaces, e.g. "VCF_files", and ensure your VCF files are uncompressed and inside it. If you have a GFF folder, match the filenames exactly (e.g. "file1.vcf" pairs with "file1.gff").

PopGenome works fine with R 4.0.3 (and later versions), so you should be able to proceed there. Install it via BiocManager::install("PopGenome") if needed. Here's simple code to read your data and calculate pi (nucleotide diversity) for the whole genome—assuming all three VCFs are in the folder and represent one population for now (adjust set.populations if splitting samples into groups):

library(PopGenome)

# Read data (omit gffpath if no GFF files)
GENOME.class <- readData("VCF_files", format = "VCF", gffpath = "GFF")

# Set population(s)—replace with your sample names from get.individuals(GENOME.class)
GENOME.class <- set.populations(GENOME.class, list(c("sample1", "sample2")), diploid = TRUE)

# Calculate diversity stats
GENOME.class <- diversity.stats(GENOME.class)

# View pi
get.diversity(GENOME.class)
# Or directly: GENOME.class@nuc.diversity.within

For windowed pi (e.g. 10kb non-overlapping), add this before diversity.stats:

GENOME.class <- sliding.window.transform(GENOME.class, width = 10000, jump = 10000, type = 2)
GENOME.class <- diversity.stats(GENOME.class)
pi_values <- GENOME.class@nuc.diversity.within / 5000  # Normalise by approx. window size in bp
head(pi_values)

Regarding phi: if you mean a differentiation statistic like PhiST (haplotype-based FST), PopGenome doesn't compute it directly—try the hierfstat or pegas packages for FST alternatives, or use command-line VCFtools (vcftools --vcf yourfile.vcf --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --out out) if your VCFs are per population. If it's something else, could you clarify?

VCFtools and DnaSP aren't R packages (VCFtools is Perl-based CLI; DnaSP is standalone software), so compatibility isn't an issue there—update to R 4.4+ anyway for better support.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin.

I think use of PopGenome should be -strongly- discouraged because it has been discontinued and removed from CRAN. If you need it it can still be installed on some systems (Linux x86 and Intel Mac) but not on others e.g. Silicon Macs. You won't have a chance to build it with a modern compiler, e.g. clang either (or if you do, please post a patch).

The only way to reliably install it via bioconda into a dedicated environment, like so (and exactly so):

conda create -c bioconda -c conda-forge -n popgenome2 r-popgenome=2.7.2

This will give you r-base 3.6.3 and popgenome 2.7.2. This is the most stable combination I have found. Do not try to install any other R-packages e.g. ggplot2 into that environment to leave versions unchanged.

Your method won't work on a recent R giving the error message:

Warning message:
package ‘PopGenome’ is not available for Bioconductor version '3.22'

Anyway, PopGenome afaik wasn't a Bioconductor package. It would be better to rewrite the whole package in a good way because it has some other problems but its functionality was quite useful. Still I think it should not be used by anyone anymore because it may create irreproducible analyses.

ADD REPLY
0
Entering edit mode

Another comment regarding window size:

  • I am using the exact window (10000 in this case) size to normalize. I am not sure why you would normalize by half the window size.
  • I'd use overlapping windows for smoother sampling.
ADD REPLY

Login before adding your answer.

Traffic: 3968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6