Sorting the haplotypes by similarities of SNPs in R
0
0
Entering edit mode
5.1 years ago
genogeno • 0

I have a data set and I want to sort it in the following way in R. I hope I can explain clearly.

1- Sort by the elements seen in the main column (focal SNP). This will give us two chunks, one chunk with all As and one chunk with all Gs.
2- Then for the first chunk, move to the -1 column position, and sort by the elements seen there (there are two elements, C/T). This will break the first chunk into two smaller chunks, one with A at the main column and C at the - 1st column; and one chunk with A at the main column and T at the - 1st column.
3- For the second chunk, move to the -1 column and do the same. I will end up with two smaller chunks, one with G at the main column and C at the - 1st column; and one with G at the main column and T at the -1th column.
4- Move to the +1 column and do the same. At each step, I will end up partitioning each of the existing chunks into two new chunks.

Actually, column names are positions(bp) in my data and the rows are haplotypes.

I do not want to break the row pattern. I want to sort the rows (swap the arrangement of the rows), but I won't re-arrange the columns. How can I do that?

An idea: I did this sorting by hand and I got a normal distribution shape. That's why I gave weights (for every column) which were obtained by normal distribution function. After that I got a weighted covariance matrix (number of rows x number of rows) by using the dissimilarity coefficient between rows and weights. Then I ranked the data by using eigenvectors of correlation matrix which has the penalty for missing data. However I could not reach the result that I reached by hand. My data is so big but I am sharing a small part of it.

-7  -6  -5  -4  -3  -2  -1  Main    1   2   3   4
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   T   C   G   C   T   C   G   G   G   T   G
A   C   C   A   C   C   T   A   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   G   G   G   T   G
A   C   C   A   T   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   G   C   T   T   G   A   G   C   T
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G

SNP clustering haplotype • 1.5k views
ADD COMMENT
0
Entering edit mode

This will order the dataframe by "Main" and "-1" (minus1). You probably should not use numbers as headers.

dat[order(dat$Main,dat$minus1),]

where dat is your full data frame

ADD REPLY
0
Entering edit mode

Thank you! Unfornutaly, it doesn't give what I want. I guess it is more complicated than that.

ADD REPLY
0
Entering edit mode

see if this works: test.txt is text in OP with tab separated values

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]


or

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
library(dplyr)
dplyr::arrange(df,Main,X.1,X1)


output:

   > df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]
X.7 X.6 X.5 X.4 X.3 X.2 X.1 Main X1 X2 X3 X4
1    A   C   C   A   C   C   T    A  G  A  T  G
2    A   C   C   A   C   C   T    A  G  A  T  G
3    A   C   C   A   C   C   T    A  G  A  T  G
5    A   C   C   A   C   C   T    A  G  A  T  G
11   A   C   C   A   C   C   T    A  G  A  T  G
12   A   C   C   A   C   C   T    A  G  A  T  G
13   A   C   C   A   C   C   T    A  G  A  T  G
14   A   C   C   A   C   C   T    A  G  A  T  G
15   A   C   C   A   C   C   T    A  G  A  T  G
16   A   C   C   A   C   C   T    A  G  A  T  G
17   A   C   C   A   C   C   T    A  G  A  T  G
18   A   C   C   A   C   C   T    A  G  A  T  G
19   A   C   C   A   C   C   T    A  G  A  T  G
20   A   C   C   A   C   C   T    A  G  A  T  G
21   A   C   C   A   C   C   T    A  G  A  T  G
23   A   C   C   A   T   C   T    A  G  A  T  G
24   A   C   C   A   C   C   T    A  G  A  T  G
25   A   C   C   A   C   C   T    A  G  A  T  G
27   A   C   C   A   C   C   T    A  G  A  T  G
28   A   C   C   A   C   C   T    A  G  A  T  G
29   A   C   C   A   C   C   T    A  G  A  T  G
30   A   C   C   A   C   C   T    A  G  A  T  G
31   A   C   C   A   C   C   T    A  G  A  T  G
32   A   C   C   A   C   C   T    A  G  A  T  G
33   A   C   C   A   C   C   T    A  G  A  T  G
34   A   C   C   A   C   C   T    A  G  A  T  G
4    A   T   C   G   C   T   C    G  G  G  T  G
26   A   C   C   G   C   T   T    G  A  G  C  T
6    G   C   T   G   C   T   T    G  G  G  T  G
7    A   C   C   A   C   C   T    G  G  A  T  G
8    G   C   T   G   C   T   T    G  G  G  T  G
9    A   C   C   A   C   C   T    G  G  A  T  G
10   A   C   C   A   C   C   T    G  G  A  T  G
22   A   C   C   A   C   C   T    G  G  G  T  G

ADD REPLY

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6