Question: Compare consecutive columns of a file and return the number of non-matching elements
0
gravatar for aritra90
3.5 years ago by
aritra9020
United States
aritra9020 wrote:

P.S: I need to use Python for this. 

I have a text file which looks like this:

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   

M rs4124251       0       0            A            G          0          A

M rs6650104       0       A            C            T          0          0

M rs12184279      0       0            G            A          T          0

I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:

for line in open("phased.txt"):
    columns = line.split("\t")

    for i in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)

I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828*828 number of outputs. (You can think of a n*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:

3 4: 1

3 5: 3

3 6: 3

......

4 6: 3
..etc

Any help on this would be appreciated. Thanks.

beagle haplotype python • 3.9k views
ADD COMMENTlink modified 3.5 years ago by george.ry1.1k • written 3.5 years ago by aritra9020

0 elements are considered as NA values and so not included in non matching counts, right? And what about an R coding approach instead of python? 

ADD REPLYlink written 3.5 years ago by Nicola Casiraghi440
2
gravatar for george.ry
3.5 years ago by
george.ry1.1k
United Kingdom
george.ry1.1k wrote:

If I understand what you're after correctly, then:

with open('test.tsv') as f:
    line = f.readline().strip().split()
    num_samples = len(line)-2
    samples = [[] for i in range(num_samples)]
    for line in f:
        line = line.strip().split()
        for s, sample in zip(line[2:], samples):
            sample.append(s)

for i, sample in enumerate(samples[:-1]):
    for j in range(i+1, num_samples):
        print(i+3, j+3, sum(a != b for a, b in zip(sample, samples[j])))

If you're using Python2 then beforehand:

from itertools import izip as zip
from __future__ import print_function
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by george.ry1.1k

George, 

Can't thank you enough. You were spot on! This increased my interest in Python :) 

ADD REPLYlink written 3.5 years ago by aritra9020
1
gravatar for Aerval
3.5 years ago by
Aerval280
Germany
Aerval280 wrote:

I would do something like this:

rows = []
with open("phased.txt") as f:
    for line in f:
        rows.append(line.strip().split("\t"))

for i, rowi in enumerate(rows[1:]): # skipping the first row because it the column description
    for j, rowj in enumerate(rows[1:]):
        matches = 0
        for n in range(len(rowi)-2): # skipping the first two columns
            if rowi[n+2] == rowj[n+2]:
                matches += 1
        print i, j, matches

 

Note that I am not sure whether this is much faster than bash (especially because its just printing and not writing to a file)

ADD COMMENTlink written 3.5 years ago by Aerval280

Thanks for the help, much appreciated! 

ADD REPLYlink written 3.5 years ago by aritra9020
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1944 users visited in the last hour