P.S: I need to use Python for this.
I have a text file which looks like this:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513 M rs4124251 0 0 A G 0 A M rs6650104 0 A C T 0 0 M rs12184279 0 0 G A T 0
I want to compare the consecutive columns and return the number of matching elements. I want to do this in Python. Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. I believe Python would be a faster solution to this. But, I am very new to Python and I already have something like this:
for line in open("phased.txt"): columns = line.split("\t") for i in range(len(columns)-1): a = columns[i+3] b = columns[i+4] for j in range(len(a)): if a[j] != b[j]: print j
which is obviously not working. As I am very new to Python, I don't really know what changes to make to get this to work. (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)
I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. I have 828 columns in totality. Hence I would need 828*828 number of outputs. (You can think of a n*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:
3 4: 1 3 5: 3 3 6: 3 ...... 4 6: 3 ..etc
Any help on this would be appreciated. Thanks.