Question

calculate nucleotide diversity from whole-genome-sequence data for individual genes

1

Entering edit mode

5 months ago

J ▴ 10

I am trying to calculate per gene nucleotide diversity (pi) for whole-genome-sequence data. I basically have whole genome resequenced data for many hundred individuals with ~1.2 million SNPs and a well annotated species with 36k genes. I was wondering if there is a method that would calculate per gene nucleotide diversity for whole genome sequencing data, ideally from a VCF file and in command line?

So far, I tried calculating pi with vcftools - - window-pi as well as - - site-pi but the window approach is not useful as my genes do not regularly distribute along windows and gaps. For the - -site-pi there is no explanation on how it is calculated and more problematically it does calculate more positions than exist. Another option was DnaSP6, but here I would need to produce vcf files for each gene and as piping or merging is not possible, all files would need to be uploaded manually.

genomics nucleotide_diversity vcf • 1.2k views

ADD COMMENT • link updated 3 months ago by sapuizait ▴ 10 • written 5 months ago by J ▴ 10

2

Entering edit mode

I would do the following - make a single file per gene that contains the start and stop positions of the exons --> supply per gene exon coordinates as bed file and ask vcftools to calculate the pi statistic (per site) --> take the average of per site pi statistic to get an overall "per-gene" --> loop this over all genes

If site-specific pi does not work for you and is not what you want then you could try calculating it yourself by applying the classic 2pq formula (p is the frequency of ancestral and q is the frequency of derived) and normalize it by the total gene length.

ADD REPLY • link 5 months ago by manaswwm ▴ 510

0

Entering edit mode

loops , loops, loops : https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php

ADD REPLY • link 5 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks, but I still need a method that actually calculates per gene nucleotide diversity. And as mentioned DnaSP6 does not give the option of batch-running vcf files, which might have been a solution otherwise. Secondly, DnaSP6 is a Windows-based program that makes piping into it a hazle.

ADD REPLY • link 5 months ago by J ▴ 10

0

Entering edit mode

You could also try the R package PopGenome, I've only ever used it in windowed analyses, but I'm sure you could extract values from specific coordinates.

Alternatively, you could write a loop that subsets the VCF for the locus of interest, run VCFtools, emit the values, and repeat across all loci.

ADD REPLY • link 5 months ago by dthorbur ★ 2.0k

0

Entering edit mode

This is not really answering the question - one can loop say VCFtools, but it would still be windows/single site based. DnaSP also does not support command line commands.

ADD REPLY • link 5 months ago by KL_STKLBK • 0

0

Entering edit mode

Did you ever find an easy way to do this? Would be very interested to know if its feasible - looking to implement sth like this using bacterial genomes

ADD REPLY • link 3 months ago by sapuizait ▴ 10