Question: LDhelmet results dependent on input file fragmentation?
0
gravatar for khench
18 months ago by
khench0
khench0 wrote:

Hi everyone!

Currently I'm trying to use LDhelmet (https://sourceforge.net/projects/ldhelmet/) to get the genome wide recombination landscape for a non-model fish.

As input I am using phased resequencing data for several populations containing 12 individuals each. Our reference genome consists of 24 linkage groups, ranging from about 13 to 31 Mb. So, to be able to run LDhelmet for the individual populations, apparently I need to chop the LGs into smaller chunks, in order to reduce the memory needed.

Before, I've tried to run LDhelmet on the complete LGs wich resulted in the funny error:

"Safety check: You probably don't want to analyze such a large number of SNPs at once. The amount of memory required will be at least 20 GB. If you do, you'll have to remove this safety check from the code. Recommendation: Use shorter partitions.."

Alas, I couldn't figure out where this mysterious safety check was hiding. So, although I am happy to provide more than the 20GB (our cluster can handle up to 250GB), I am now trying to use shorter partitions indeed.

Yet, when trying this approach on the example data provided with the LDhelmet software, I stumbled upon an oddity. When comparing the results of the LDhelmet run of the whole dataset to those of the runs on the subsets, the individual runs seem to be scaled differently:

Comparison of the results of the LDhelmet example: black dots are the results from the complete data set, red and orange are the results from the partitioned data

So, right now I'm wondering if:

1) this behavior is expected - shouldn't the estimated recombination rate be similar for the partitioned vs unpartitioned data?

2) partitioning my data will lead to different results compared to a hypothetical run on the complete LGs?

3) the recombination rate of different LGs can be compared? It seems to me that if the individual partitions scale differently, so will the individual LGs.

Can anyone experienced with LDhelmet make sense of this behavior?


Below is the code used to create the partitions & and the comparison:

cp path/to/LDhelmet_v1.9/example_scripts
rm mut_mat.txt

(We don't have ancestral alleles for our non-model species (as we do not have an out group), so I deleted the mutation matrix to use the stationary distribution of P as described in Chan et al. (2012): http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003090#s2)

Hence, I also removed the mutation matrix from two of the example commands:

from max_lk.bash

../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta

to:

../ldhelmet max_lk --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta

and from: rjmcmc.bash:

time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -m mut_mat.txt -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post

to:

time ../ldhelmet rjmcmc --num_threads 24 -l output/output.lk -p output/output.pade -s input.fasta -b 50 --burn_in 10000 -n 100000 -o output/output.post

Than, I first ran LDhelmet on the whole example data:

bash run_all.bash
mv output/ output_whole/
mv input.fasta input_whole.fasta

Than, I partitioned the input file and reran LDhelmet on the partitions:

first partition:

awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,1,12000)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_1/

second partition:

awk 'NR%2 != 0 {print;} NR%2 == 0 {print substr($1,12000,25001)}' input_whole.fasta > input.fasta
bash run_all.bash
mv output/ output_2/

Finally, I visualized the results using R:

R
library(ggplot2)
data_whole <- read.csv('output_whole/output.txt',sep=' ',skip = 2)
names(data_whole)[1:6]<-names(data_whole)[2:7];data_whole <- data_whole[,1:6]
data1 <- read.csv('output_1/output.txt',sep=' ',skip = 2)
names(data1)[1:6]<-names(data1)[2:7];data1 <- data1[,1:6]
data2 <- read.csv('output_2/output.txt',sep=' ',skip = 2)
names(data2)[1:6]<-names(data2)[2:7];data2 <- data2[,1:6]

ggplot()+
geom_point(data = data1,aes(x=(left_snp+right_snp)/2,y=mean),col='red',size=2)+
geom_point(data = data2,aes(x=(left_snp+max(data1$right_snp)+right_snp+max(data1$right_snp))/2,y=mean),col='orange',size=2)+
geom_point(data=data_whole,aes(x=(left_snp+right_snp)/2,y=mean),size=.8)
ADD COMMENTlink modified 18 months ago • written 18 months ago by khench0
0
gravatar for khench
18 months ago by
khench0
khench0 wrote:

hmm - apparently the figure was not added to the question:

link: https://ibb.co/jb1SYw

ADD COMMENTlink written 18 months ago by khench0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1509 users visited in the last hour