Question: Allelic Distance calculation using PED format data
0
gravatar for aritra90
3.7 years ago by
aritra9040
United States
aritra9040 wrote:

Hi, 

I need to find the pairwise allelic distance between individuals from a PLINK (PED/MAP) file. I have written a Python code for it and it's taking a long time to compute the similarity distance matrix. My input file is huge (~25 GB) with 5000 individuals and ~2.6 million SNPs. I couldn't find any software package to compute the allelic distance and hence I decided to write my own code. This should be pretty straight forward in Matlab, but, because of the sheer volume of the data, I decided to go with Python (Though, I'm very new to Python and learning my trade around it). I am stating the problem here: 

I have a file as such: 

FAM001  1  0 0  1  2  A A  G G  A C 
FAM002  2  0 0  1  2  A A  A G  0 0 

And the allelic distance metric is : 

For each SNP count how many alleles differ between two samples at the same locus. If the first sample is AA and the second is AG, then the allelic distance is 1, if the first is AA and the second is GG, then it is two, if they are both the same, it is zero, etc. After finding this for all the locus which is about (~1.3 mil) Sum up over all SNPs and this should be the allelic distance. If there are missing entries in either sample, ignore that SNP. To normalize, we can divide the sum by the number of SNPS taken to compute this, to maintain generality. 

 

My code is this : 

      import numpy as np
      import difflib

      lines = np.genfromtxt('data_ped.txt')
      s = (len(lines),len(lines))
      mat = np.zeros(s)

      for i in np.arange(len(lines)):
           val = np.split(lines[i],2)
            j = i+1
            for j in (np.arange(len(lines)-1)):
                c_val  = []
                newval = np.split(lines[j],2)
                for k in np.arange(len(val)):
                        matching = difflib.SequenceMatcher(None,a=val[k], b=newval[k])
                        match = matching.find_longest_match(0,len(matching.a),0,len(matching.b))
                        count = len(matching.a[match.a:match.a+match.size])
                        c_val.append(count)

                (mat[i])[j] = sum(c_val)/len(val)

np.savetxt('opma.txt',mat)

 

This is taking awfully long to compute and I am not too sure whether this works as well. It'd be great if someone can interpret the problem and help me with this. Or, if anyone has any knowledge about a software that I can use to compute this, that'd be great as well. 

Thanks! 

snp plink allele python genome • 1.6k views
ADD COMMENTlink modified 3.7 years ago by chrchang5234.9k • written 3.7 years ago by aritra9040

Hello aritra90!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/31767307

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum119k

Thanks for bringing my attention to this. I wasn't sure where to post this question as I haven't used the forums extensively before. Sorry about that. I'll remove the other post. 

ADD REPLYlink written 3.7 years ago by aritra9040
3
gravatar for chrchang523
3.7 years ago by
chrchang5234.9k
United States
chrchang5234.9k wrote:

With PLINK 1.9:

plink --file [.ped/.map prefix] --distance 1-ibs flat-missing

(I suggest converting to PLINK binary format first, then using --bfile; otherwise PLINK will create binary-format temporary files anyway and throw them away at the end of the computation.)  See https://www.cog-genomics.org/plink2/distance#distance for more options.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by chrchang5234.9k

I checked this earlier, but was skeptical. I see you use Hamming Distance, which is kind of what I'm looking at as well. You think this will do the exact same thing, what I intend for ? Thanks for the link, btw :)

P.S: Are you the developer of PLINK 1.9 ? (Just guessing, seeing your username) 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by aritra9040

PLINK 1.9 is insanely faster than the original PLINK, its still running the IBS distance and it has been about 14 hours. But, this took approx 7-8 mins to compute the distance matrix. Thank you for the link again. 

ADD REPLYlink written 3.7 years ago by aritra9040

Yes, I am the developer.

1.9 actually started out as a pure distance matrix calculator, so that part of the program is especially well-optimized.

ADD REPLYlink written 3.7 years ago by chrchang5234.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour