Question: Sorting the haplotypes by similarities of SNPs in R
0
gravatar for genogeno
14 months ago by
genogeno0
genogeno0 wrote:

I have a data set and I want to sort it in the following way in R. I hope I can explain clearly.

1- Sort by the elements seen in the main column (focal SNP). This will give us two chunks, one chunk with all As and one chunk with all Gs.
2- Then for the first chunk, move to the -1 column position, and sort by the elements seen there (there are two elements, C/T). This will break the first chunk into two smaller chunks, one with A at the main column and C at the - 1st column; and one chunk with A at the main column and T at the - 1st column.
3- For the second chunk, move to the -1 column and do the same. I will end up with two smaller chunks, one with G at the main column and C at the - 1st column; and one with G at the main column and T at the -1th column.
4- Move to the +1 column and do the same. At each step, I will end up partitioning each of the existing chunks into two new chunks.

Actually, column names are positions(bp) in my data and the rows are haplotypes.

I do not want to break the row pattern. I want to sort the rows (swap the arrangement of the rows), but I won't re-arrange the columns. How can I do that?

An idea: I did this sorting by hand and I got a normal distribution shape. That's why I gave weights (for every column) which were obtained by normal distribution function. After that I got a weighted covariance matrix (number of rows x number of rows) by using the dissimilarity coefficient between rows and weights. Then I ranked the data by using eigenvectors of correlation matrix which has the penalty for missing data. However I could not reach the result that I reached by hand. My data is so big but I am sharing a small part of it.

-7  -6  -5  -4  -3  -2  -1  Main    1   2   3   4
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   T   C   G   C   T   C   G   G   G   T   G
A   C   C   A   C   C   T   A   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   G   G   G   T   G
A   C   C   A   T   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   G   C   T   T   G   A   G   C   T
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
clustering snp haplotype • 636 views
ADD COMMENTlink modified 12 months ago by Biostar ♦♦ 20 • written 14 months ago by genogeno0

This will order the dataframe by "Main" and "-1" (minus1). You probably should not use numbers as headers.

dat[order(dat$Main,dat$minus1),]

where dat is your full data frame

ADD REPLYlink modified 14 months ago • written 14 months ago by christopher medway440

Thank you! Unfornutaly, it doesn't give what I want. I guess it is more complicated than that.

ADD REPLYlink written 13 months ago by genogeno0

see if this works: test.txt is text in OP with tab separated values

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]

or

df=read.csv("test.txt", stringsAsFactors = F, header = T, sep = "\t")
library(dplyr)
dplyr::arrange(df,Main,X.1,X1)

output:

   > df [order(df[,"Main"], df[,"X.1"],df[,"X1"]),]
       X.7 X.6 X.5 X.4 X.3 X.2 X.1 Main X1 X2 X3 X4
    1    A   C   C   A   C   C   T    A  G  A  T  G
    2    A   C   C   A   C   C   T    A  G  A  T  G
    3    A   C   C   A   C   C   T    A  G  A  T  G
    5    A   C   C   A   C   C   T    A  G  A  T  G
    11   A   C   C   A   C   C   T    A  G  A  T  G
    12   A   C   C   A   C   C   T    A  G  A  T  G
    13   A   C   C   A   C   C   T    A  G  A  T  G
    14   A   C   C   A   C   C   T    A  G  A  T  G
    15   A   C   C   A   C   C   T    A  G  A  T  G
    16   A   C   C   A   C   C   T    A  G  A  T  G
    17   A   C   C   A   C   C   T    A  G  A  T  G
    18   A   C   C   A   C   C   T    A  G  A  T  G
    19   A   C   C   A   C   C   T    A  G  A  T  G
    20   A   C   C   A   C   C   T    A  G  A  T  G
    21   A   C   C   A   C   C   T    A  G  A  T  G
    23   A   C   C   A   T   C   T    A  G  A  T  G
    24   A   C   C   A   C   C   T    A  G  A  T  G
    25   A   C   C   A   C   C   T    A  G  A  T  G
    27   A   C   C   A   C   C   T    A  G  A  T  G
    28   A   C   C   A   C   C   T    A  G  A  T  G
    29   A   C   C   A   C   C   T    A  G  A  T  G
    30   A   C   C   A   C   C   T    A  G  A  T  G
    31   A   C   C   A   C   C   T    A  G  A  T  G
    32   A   C   C   A   C   C   T    A  G  A  T  G
    33   A   C   C   A   C   C   T    A  G  A  T  G
    34   A   C   C   A   C   C   T    A  G  A  T  G
    4    A   T   C   G   C   T   C    G  G  G  T  G
    26   A   C   C   G   C   T   T    G  A  G  C  T
    6    G   C   T   G   C   T   T    G  G  G  T  G
    7    A   C   C   A   C   C   T    G  G  A  T  G
    8    G   C   T   G   C   T   T    G  G  G  T  G
    9    A   C   C   A   C   C   T    G  G  A  T  G
    10   A   C   C   A   C   C   T    G  G  A  T  G
    22   A   C   C   A   C   C   T    G  G  G  T  G
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour