Linkage information from unphased VCF files
2
1
Entering edit mode
7.6 years ago
Rubal ▴ 340

Hello Everyone,

I am looking for the best way to get linkage information from unphased whole genome population data. I have a vcf file with multiple individuals from different populations. The data is unphased but I would like to detect regions with an excess of linkage disequilibrium as a measure of positive selection. I have not phased the data because I have a limited number of individuals per population of a non-model species and therefore worry that phasing will be very inaccurate.

What do people think would be the best way to detect regions with high levels of linkage disequilibrium? I was thinking something like VCFtools --geno-r2 option might be suitable.

Best regards,
Rubal

next-gen vcf genome • 4.5k views
3
Entering edit mode
7.6 years ago

The --geno-r2 option in vcftools should be enough for your needs; however, you can not calculate linkage disequilibrium if your data is not phased. If you were studying human individuals, I would suggest you to impute more genotypes by merging it with the 1000 Genomes data, but if you say that you are working with a non-model organism, this is not an option. Is there a close model organism that you can use to impute data?

3
Entering edit mode
7.6 years ago

Giovanni M Dall'Olio is correct, it is advisable to phase the variants. In the off chance that you are interested in haplotype decay I have a tool that takes an un-phased VCF files, phases it and then calculates XP-EHH. I'm also working on a version for LD.

https://github.com/jewmanchue/vcflib/wiki/Haplotype-Decay

0
Entering edit mode

That sounds like a promising tool I will give it a go. It mentions that it will give slightly different results each time due to the stochastic search. Would you recommend a multiple iterations approach? Also is there an option for specifying window sizes, or would you do post-hoc averaging of scores across sites for windows? Thanks very much

0
Entering edit mode

Running it several times will allow you to generate a confidence interval around the XP-EHH score. Window size is determined by the number of SNPs required for EHH to decay to 0.05 and isn't specified by the user.