Question: Allelic Distance calculation using PED format data
0
aritra9060 wrote:

Hi,

I need to find the pairwise allelic distance between individuals from a PLINK (PED/MAP) file. I have written a Python code for it and it's taking a long time to compute the similarity distance matrix. My input file is huge (~25 GB) with 5000 individuals and ~2.6 million SNPs. I couldn't find any software package to compute the allelic distance and hence I decided to write my own code. This should be pretty straight forward in Matlab, but, because of the sheer volume of the data, I decided to go with Python (Though, I'm very new to Python and learning my trade around it). I am stating the problem here:

I have a file as such:

```FAM001  1  0 0  1  2  A A  G G  A C
FAM002  2  0 0  1  2  A A  A G  0 0 ```

And the allelic distance metric is :

For each SNP count how many alleles differ between two samples at the same locus. If the first sample is AA and the second is AG, then the allelic distance is 1, if the first is AA and the second is GG, then it is two, if they are both the same, it is zero, etc. After finding this for all the locus which is about (~1.3 mil) Sum up over all SNPs and this should be the allelic distance. If there are missing entries in either sample, ignore that SNP. To normalize, we can divide the sum by the number of SNPS taken to compute this, to maintain generality.

My code is this :

import numpy as np
import difflib

lines = np.genfromtxt('data_ped.txt')
s = (len(lines),len(lines))
mat = np.zeros(s)

for i in np.arange(len(lines)):
val = np.split(lines[i],2)
j = i+1
for j in (np.arange(len(lines)-1)):
c_val  = []
newval = np.split(lines[j],2)
for k in np.arange(len(val)):
matching = difflib.SequenceMatcher(None,a=val[k], b=newval[k])
match = matching.find_longest_match(0,len(matching.a),0,len(matching.b))
count = len(matching.a[match.a:match.a+match.size])
c_val.append(count)

(mat[i])[j] = sum(c_val)/len(val)

np.savetxt('opma.txt',mat)

This is taking awfully long to compute and I am not too sure whether this works as well. It'd be great if someone can interpret the problem and help me with this. Or, if anyone has any knowledge about a software that I can use to compute this, that'd be great as well.

Thanks!

snp plink allele python genome • 2.0k views
modified 5.3 years ago by chrchang5237.3k • written 5.3 years ago by aritra9060

Hello aritra90!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/31767307

This is typically not recommended as it runs the risk of annoying people in both communities.

Thanks for bringing my attention to this. I wasn't sure where to post this question as I haven't used the forums extensively before. Sorry about that. I'll remove the other post.

3
chrchang5237.3k wrote:

`plink --file [.ped/.map prefix] --distance 1-ibs flat-missing`

(I suggest converting to PLINK binary format first, then using --bfile; otherwise PLINK will create binary-format temporary files anyway and throw them away at the end of the computation.)  See https://www.cog-genomics.org/plink2/distance#distance for more options.

I checked this earlier, but was skeptical. I see you use Hamming Distance, which is kind of what I'm looking at as well. You think this will do the exact same thing, what I intend for ? Thanks for the link, btw :)

PLINK 1.9 is insanely faster than the original PLINK, its still running the IBS distance and it has been about 14 hours. But, this took approx 7-8 mins to compute the distance matrix. Thank you for the link again.

Yes, I am the developer.

1.9 actually started out as a pure distance matrix calculator, so that part of the program is especially well-optimized.