Question

Standardization of DNA data

0

Entering edit mode

8.1 years ago

manay ▴ 10

Hi, When I have a data set which includes 0 and 1, I can standardize the data easily in the following way. (each row denotes a chromosome and each column denotes a SNP)

p<- apply(input, 2, mean, na.rm = T)
 mat <- matrix(, nr = nchr, nc = nsnp)
   for (i in 1:nsnp){
      mat[,i] <- (data[,i] - p[i])/sqrt(p[i] * (1 - p[i]))
   }

However, I have a data set which includes A,T,G,C. Is it possible to standardize this data?

It is a small part of my data:

NA06989_A   A   A   G   G   C
NA06989_B   A   A   G   G   C
NA10850_A   G   A   G   G   C
NA10850_B   G   G   A   G   C
NA06984_A   G   G   A   G   C
NA06984_B   A   A   G   G   C
NA11917_A   G   A   A   T   C
NA11917_B   A   A   G   G   C
NA12282_A   A   A   G   G   C
NA12282_B   G   G   A   G   C

R SNP gene genome • 1.3k views

ADD COMMENT • link 8.1 years ago by manay ▴ 10

score 1 · Answer 1 · 2017-06-20

1

Entering edit mode

8.1 years ago

Fabio Marroni ★ 3.0k

"0" and "1" SNP genotype are just arbitrary representations of the two alleles. So, yes, you can standardize the data you showed by replacing (arbitrarily) one letter with 0 and one with 1. However, I am not sure if standardizing genotype data the way you do is a good idea.

ADD COMMENT • link 8.1 years ago by Fabio Marroni ★ 3.0k