Question

LDhelmet results dependent on input file fragmentation?

0

Entering edit mode

6.5 years ago

khench • 0

Hi everyone!

Currently I'm trying to use LDhelmet (https://sourceforge.net/projects/ldhelmet/) to get the genome wide recombination landscape for a non-model fish.

As input I am using phased resequencing data for several populations containing 12 individuals each. Our reference genome consists of 24 linkage groups, ranging from about 13 to 31 Mb. So, to be able to run LDhelmet for the individual populations, apparently I need to chop the LGs into smaller chunks, in order to reduce the memory needed.

Before, I've tried to run LDhelmet on the complete LGs wich resulted in the funny error:

"Safety check: You probably don't want to analyze such a large number of SNPs at once. The amount of memory required will be at least 20 GB. If you do, you'll have to remove this safety check from the code. Recommendation: Use shorter partitions.."

Alas, I couldn't figure out where this mysterious safety check was hiding. So, although I am happy to provide more than the 20GB (our cluster can handle up to 250GB), I am now trying to use shorter partitions indeed.

Yet, when trying this approach on the example data provided with the LDhelmet software, I stumbled upon an oddity. When comparing the results of the LDhelmet run of the whole dataset to those of the runs on the subsets, the individual runs seem to be scaled differently:

Comparison of the results of the LDhelmet example: black dots are the results from the complete data set, red and orange are the results from the partitioned data

So, right now I'm wondering if:

1) this behavior is expected - shouldn't the estimated recombination rate be similar for the partitioned vs unpartitioned data?

2) partitioning my data will lead to different results compared to a hypothetical run on the complete LGs?

3) the recombination rate of different LGs can be compared? It seems to me that if the individual partitions scale differently, so will the individual LGs.

Can anyone experienced with LDhelmet make sense of this behavior?

Below is the code used to create the partitions & and the comparison:

cp path/to/LDhelmet_v1.9/example_scripts
rm mut_mat.txt

(We don't have ancestral alleles for our non-model species (as we do not have an out group), so I deleted the mutation matrix to use the stationary distribution of P as described in Chan et al. (2012): http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003090#s2)

Hence, I also removed the mutation matrix from two of the example commands:

from max_lk.bash

../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta

to:

../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta

and from: rjmcmc.bash:

time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post

to:

time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post

Than, I first ran LDhelmet on the whole example data:

bash run_all.bash
mv output/ output_whole/
mv input.fasta input_whole.fasta

Than, I partitioned the input file and reran LDhelmet on the partitions:

first partition:

awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,1,12000)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_1/

second partition:

awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,12000,25001)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_2/

Finally, I visualized the results using R:

R
library(ggplot2)
data_whole <- read.csv('output_whole/output.txt',sep=' ',skip = 2)
names(data_whole)[1:6]<-names(data_whole)[2:7];data_whole <- data_whole[,1:6]
data1 <- read.csv('output_1/output.txt',sep=' ',skip = 2)
names(data1)[1:6]<-names(data1)[2:7];data1 <- data1[,1:6]
data2 <- read.csv('output_2/output.txt',sep=' ',skip = 2)
names(data2)[1:6]<-names(data2)[2:7];data2 <- data2[,1:6]

ggplot()+
geom_point(data = data1,aes(x=(left_snp+right_snp)/2,y=mean),col='red',size=2)+
geom_point(data = data2,aes(x=(left_snp+max(data1$right_snp)+right_snp+max(data1$right_snp))/2,y=mean),col='orange',size=2)+
geom_point(data=data_whole,aes(x=(left_snp+right_snp)/2,y=mean),size=.8)

snp genome LDhelmet software error • 2.3k views

ADD COMMENT • link 6.5 years ago by khench • 0

score 0 · Answer 1 · 2017-10-11

0

Entering edit mode

6.5 years ago

khench • 0

hmm - apparently the figure was not added to the question:

link: https://ibb.co/jb1SYw

ADD COMMENT • link 6.5 years ago by khench • 0