Hi everyone!
Currently I'm trying to use LDhelmet (https://sourceforge.net/projects/ldhelmet/) to get the genome wide recombination landscape for a non-model fish.
As input I am using phased resequencing data for several populations containing 12 individuals each. Our reference genome consists of 24 linkage groups, ranging from about 13 to 31 Mb. So, to be able to run LDhelmet for the individual populations, apparently I need to chop the LGs into smaller chunks, in order to reduce the memory needed.
Before, I've tried to run LDhelmet on the complete LGs wich resulted in the funny error:
"Safety check: You probably don't want to analyze such a large number of SNPs at once. The amount of memory required will be at least 20 GB. If you do, you'll have to remove this safety check from the code. Recommendation: Use shorter partitions.."
Alas, I couldn't figure out where this mysterious safety check was hiding. So, although I am happy to provide more than the 20GB (our cluster can handle up to 250GB), I am now trying to use shorter partitions indeed.
Yet, when trying this approach on the example data provided with the LDhelmet software, I stumbled upon an oddity. When comparing the results of the LDhelmet run of the whole dataset to those of the runs on the subsets, the individual runs seem to be scaled differently:
So, right now I'm wondering if:
1) this behavior is expected - shouldn't the estimated recombination rate be similar for the partitioned vs unpartitioned data?
2) partitioning my data will lead to different results compared to a hypothetical run on the complete LGs?
3) the recombination rate of different LGs can be compared? It seems to me that if the individual partitions scale differently, so will the individual LGs.
Can anyone experienced with LDhelmet make sense of this behavior?
Below is the code used to create the partitions & and the comparison:
cp path/to/LDhelmet_v1.9/example_scripts
rm mut_mat.txt
(We don't have ancestral alleles for our non-model species (as we do not have an out group), so I deleted the mutation matrix to use the stationary distribution of P as described in Chan et al. (2012): http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003090#s2)
Hence, I also removed the mutation matrix from two of the example commands:
from max_lk.bash
../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta
to:
../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta
and from: rjmcmc.bash:
time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post
to:
time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post
Than, I first ran LDhelmet on the whole example data:
bash run_all.bash
mv output/ output_whole/
mv input.fasta input_whole.fasta
Than, I partitioned the input file and reran LDhelmet on the partitions:
first partition:
awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,1,12000)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_1/
second partition:
awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,12000,25001)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_2/
Finally, I visualized the results using R:
R
library(ggplot2)
data_whole <- read.csv('output_whole/output.txt',sep=' ',skip = 2)
names(data_whole)[1:6]<-names(data_whole)[2:7];data_whole <- data_whole[,1:6]
data1 <- read.csv('output_1/output.txt',sep=' ',skip = 2)
names(data1)[1:6]<-names(data1)[2:7];data1 <- data1[,1:6]
data2 <- read.csv('output_2/output.txt',sep=' ',skip = 2)
names(data2)[1:6]<-names(data2)[2:7];data2 <- data2[,1:6]
ggplot()+
geom_point(data = data1,aes(x=(left_snp+right_snp)/2,y=mean),col='red',size=2)+
geom_point(data = data2,aes(x=(left_snp+max(data1$right_snp)+right_snp+max(data1$right_snp))/2,y=mean),col='orange',size=2)+
geom_point(data=data_whole,aes(x=(left_snp+right_snp)/2,y=mean),size=.8)