Sorting the haplotypes by similarities of SNPs in R
5.1 years ago
genogeno

I have a data set and I want to sort it in the following way in R. I hope I can explain clearly.

1- Sort by the elements seen in the main column (focal SNP). This will give us two chunks, one chunk with all As and one chunk with all Gs.
2- Then for the first chunk, move to the -1 column position, and sort by the elements seen there (there are two elements, C/T). This will break the first chunk into two smaller chunks, one with A at the main column and C at the - 1st column; and one chunk with A at the main column and T at the - 1st column.
3- For the second chunk, move to the -1 column and do the same. I will end up with two smaller chunks, one with G at the main column and C at the - 1st column; and one with G at the main column and T at the -1th column.
4- Move to the +1 column and do the same. At each step, I will end up partitioning each of the existing chunks into two new chunks.

Actually, column names are positions(bp) in my data and the rows are haplotypes.

I do not want to break the row pattern. I want to sort the rows (swap the arrangement of the rows), but I won't re-arrange the columns. How can I do that?

An idea: I did this sorting by hand and I got a normal distribution shape. That's why I gave weights (for every column) which were obtained by normal distribution function. After that I got a weighted covariance matrix (number of rows x number of rows) by using the dissimilarity coefficient between rows and weights. Then I ranked the data by using eigenvectors of correlation matrix which has the penalty for missing data. However I could not reach the result that I reached by hand. My data is so big but I am sharing a small part of it.

-7  -6  -5  -4  -3  -2  -1  Main    1   2   3   4
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   T   C   G   C   T   C   G   G   G   T   G
A   C   C   A   C   C   T   A   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   G   G   G   T   G
A   C   C   A   T   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   G   C   T   T   G   A   G   C   T
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G

This will order the dataframe by "Main" and "-1" (minus1). You probably should not use numbers as headers.

dat[order(dat$Main,dat$minus1),]

where dat is your full data frame

Thank you! Unfornutaly, it doesn't give what I want. I guess it is more complicated than that.

see if this works: test.txt is text in OP with tab separated values

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]


or

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
library(dplyr)
dplyr::arrange(df,Main,X.1,X1)


output:

   > df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]
X.7 X.6 X.5 X.4 X.3 X.2 X.1 Main X1 X2 X3 X4
1    A   C   C   A   C   C   T    A  G  A  T  G
2    A   C   C   A   C   C   T    A  G  A  T  G
3    A   C   C   A   C   C   T    A  G  A  T  G
5    A   C   C   A   C   C   T    A  G  A  T  G
11   A   C   C   A   C   C   T    A  G  A  T  G
12   A   C   C   A   C   C   T    A  G  A  T  G
13   A   C   C   A   C   C   T    A  G  A  T  G
14   A   C   C   A   C   C   T    A  G  A  T  G
15   A   C   C   A   C   C   T    A  G  A  T  G
16   A   C   C   A   C   C   T    A  G  A  T  G
17   A   C   C   A   C   C   T    A  G  A  T  G
18   A   C   C   A   C   C   T    A  G  A  T  G
19   A   C   C   A   C   C   T    A  G  A  T  G
20   A   C   C   A   C   C   T    A  G  A  T  G
21   A   C   C   A   C   C   T    A  G  A  T  G
23   A   C   C   A   T   C   T    A  G  A  T  G
24   A   C   C   A   C   C   T    A  G  A  T  G
25   A   C   C   A   C   C   T    A  G  A  T  G
27   A   C   C   A   C   C   T    A  G  A  T  G
28   A   C   C   A   C   C   T    A  G  A  T  G
29   A   C   C   A   C   C   T    A  G  A  T  G
30   A   C   C   A   C   C   T    A  G  A  T  G
31   A   C   C   A   C   C   T    A  G  A  T  G
32   A   C   C   A   C   C   T    A  G  A  T  G
33   A   C   C   A   C   C   T    A  G  A  T  G
34   A   C   C   A   C   C   T    A  G  A  T  G
4    A   T   C   G   C   T   C    G  G  G  T  G
26   A   C   C   G   C   T   T    G  A  G  C  T
6    G   C   T   G   C   T   T    G  G  G  T  G
7    A   C   C   A   C   C   T    G  G  A  T  G
8    G   C   T   G   C   T   T    G  G  G  T  G
9    A   C   C   A   C   C   T    G  G  A  T  G
10   A   C   C   A   C   C   T    G  G  A  T  G
22   A   C   C   A   C   C   T    G  G  G  T  G