Question

Is LD pruning required for a data set from a clonal population to prepare the data (VCF) for ADMIXTURE and/or PCA?

0

Entering edit mode

2.1 years ago

p.m ▴ 20

Dear all, as part of a study we have done full genome sequencing of 74 Toxoplasma gondii isolates (haploid, protozoan zoonotic parasite, genome size about 70 mb) using Illumina technology. The isolates/strains are very closely related to each other (clonal population). In this study, we would like to analyze microdiversity within this clonal population in more detail. For this we used ADMIXTURE and PCA. For these analyses we have performed LD pruning of the SNPs applying PLINK1.9 to prepare the data and have found that we loose very large number of SNPs.

My question: is it absolutely necessary to use LD pruning for the SNP datasets (VCFs) of a clonal population before doing ADMIXTURE and/or PCA?

Thanks for your help

Pavlo

SNP VCF pruning clonal population LD • 1.1k views

ADD COMMENT • link 2.1 years ago by p.m ▴ 20

score 1 · Answer 1 · 2022-04-07

1

Entering edit mode

2.1 years ago

4galaxy77 2.8k

From the ADMIXTURE manual

We tend to believe this is a good idea, since our model does not explicitly take LD into consideration, and since enormous data sets take more time to analyze. It is impossible to “remove” all LD, especially in recently-admixed populations, which have a high degree of “admixture LD”. Two approaches to mitigating the effects of LD are to include markers that are separated from each other by a certain genetic distance, or to thin the markers according the observed sample correlation coefficients. The easiest way is the latter, using the --indep-pairwise option of PLINK. For example, if we start with a file rawData.bed, we could use the following commands to prune according to a correlation threshold and store the pruned dataset in prunedData.bed

Similarly, if you have many SNPs in long-range LD, these will possibly screw up some of your lower PC dimensions which will give you false population structure.

So yes, it is a very good idea to prune. How many SNPs you need depends entirely on how genetically differentiated your populations are. How many are you left after pruning? Perhaps try a less stringent threshold if you are concerned you don't have enough power.

ADD COMMENT • link 2.1 years ago by 4galaxy77 2.8k

0

Entering edit mode

Dear 4galaxy77,

thank you for your very fast help. If I apply the filter for LD pruning 50 5 0.2 I still have 23373 SNPs of 85076. So I try to test different cutoffs as you rocomended.

is there, in general, a rule of thumb as to which cutoffs can be "legally" used/tested? I have tested following combinations

win_size=(8 10 20 40 50)    
vifs=(0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8)
shifts=(2 3 4 5)    
for win in ${win_size[*]};do    
for vif in  ${vifs[*]};do    
for shift in  ${shifts[*]};do    
plink --noweb --file myplink --chr-set -14  -indep-pairwise ${win} ${shift} ${vif} -out ~/SNP_${win}_${shift}_${vif}    
done    
done
done

The next question which arises, when I tested different Cutoffs. Is there any way, or any parameter, which can be used to make a decision which cutoff is stringent enough?

At the moment I try to test all those cutoffs in ADMIXTURE analysis as follows:

win_size=(8 10 20 40 50)    
vifs=(0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8)
shifts=(2 3 4 5)
for win in ${win_size[*]};do
for vif in  ${vifs[*]};do
for shift in  ${shifts[*]};do
for K in 1 2 3 4 5 6 7 8; do admixture --cv -j7 --haploid="*" ${win}_${shift}_${vif}_pruneddata.ped $K | tee log${K}_${win}_${shift}_${vif}.out; done
done
done
done

...and extract the CVE values to check which "K" in the respective Cutoff scenario have the lowest CVE value in overall.

win_size=(8 10 20 40 50)    
vifs=(0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8)
shifts=(2 3 4 5)
for win in ${win_size[*]};do
for vif in  ${vifs[*]};do
for shift in  ${shifts[*]};do
print ${win}_${shift}_${vif}
grep -h CV log*_${win}_${shift}_${vif}.out  > ${win}_${shift}_${vif}.txt
awk '{print $0, FILENAME}' ${win}_${shift}_${vif}.txt >> cv_diff_cutoffs.txt
done
done
done

Finally I get such results:

      library(ggplot2)
      library(dplyr)    
      cv2=read.table("cv_diff_cutoffs.txt",header=TRUE)
      cv2 %>% ggplot(aes(gsub("\\(","",gsub("\\):","",K)), CV)) +
      geom_point(color = "steelblue") + 
      labs(y = "CVR", x = "Number of groups tested by Admixture") + 
      facet_wrap(~ gsub(".txt","",Name), ncol=7)

CVEs from different LD prunning cutoffs

What do you think is it the right way to decide which cutoff should be used for LD pruning?

Thank you for your help Kind regards Pavlo

ADD REPLY • link 2.1 years ago by p.m ▴ 20