R not reading numbers from file
3
0
Entering edit mode
5.7 years ago
nkinney06 ▴ 90

I am trying to read the following file into R variables:

5060803636482931868     83.3366666      0.0     0.0     0.0
15695800775901642752    0.0     81.0061726043   38.1837661841   0.0
12047011437325700351    0.0     38.1837661841   22.2036033112   7.07106781187
2610937148294873212     0.0     0.0     7.07106781187   30.1330383466


The first column are unique keys and the rest is a 4x4 matrix;

I try reading with the following:

fileContents <- as.matrix(read.table('./distanceMatrix.txt', header=FALSE, sep = "\t",strip.white=TRUE))
nameKey <- fileContents[,1]
distMatrix <- fileContents[,-1]


I get this result:

> nameKey

[1]  5060803636482931712 15695800775901642752 12047011437325701120  2610937148294873088

> distMatrix
V2       V3        V4        V5
[1,] 83.33667  0.00000  0.000000  0.000000
[2,]  0.00000 81.00617 38.183766  0.000000
[3,]  0.00000 38.18377 22.203603  7.071068
[4,]  0.00000  0.00000  7.071068 30.133038


notice how the keys don't match the file. I need to be sure everything gets read in properly and make sure I can write it out properly. What am I doing wrong?

R • 1.4k views
1
Entering edit mode

why not:

fileContents <- as.matrix( read.table( './distanceMatrix.txt', header=FALSE,
sep = "\t", strip.white=TRUE,
row.names = 1 ) )

1
Entering edit mode

How is this a bioinformatics question?

1
Entering edit mode
5.7 years ago
h.mon 34k

A couple of suggestions:

1) read everything as character and later convert to number

fileContents <- as.matrix(read.table('distanceMatrix.txt', header=FALSE,
row.names = 1, sep = "\t", strip.white=TRUE,
colClasses = "character" ) )
nameKey <- rownames(fileContents)
distMatrix <- as.numeric(fileContents )
dim(distMatrix) <- dim(fileContents)


Ill try but in reality my matrix will be rather large and the number of NAs would have to be dynamically assigned

I don't know how the file is being created, but maybe:

2) you can prepend a character to the first element of every row before reading the file into R - this can be accomplished in place with sed, without creating a copy of the file.

3) split the file into one file with row names, and other with the matrix numbers.

2
Entering edit mode
5.7 years ago

To expand a bit on what h.mon correctly wrote, your issue is that you're not treating row names as row names, but rather converting them to numbers. Since they're HUGE numbers, they're presumably getting stored a floats or doubles, which means you're not going to get the exact value back. Of course, you don't need that as a value, just a name, so treat them accordingly (i.e., do what h.mon showed).

0
Entering edit mode
5.7 years ago
nkinney06 ▴ 90

That makes sense, but I appear to have the same problem when I run:

similarityMatrix <- as.matrix( read.table( './testMatrix.txt', header=FALSE, sep = "\t", strip.white=TRUE, row.names = 1 ) )


I get:

> similarityMatrix
V2       V3        V4        V5
5060803636482931712  83.33667  0.00000  0.000000  0.000000
15695800775901642752  0.00000 81.00617 38.183766  0.000000
12047011437325701120  0.00000 38.18377 22.203603  7.071068
2610937148294873088   0.00000  0.00000  7.071068 30.133038


and the matrix ( in particular the row names ) should be

5060803636482931868 83.3366666  0.0 0.0 0.0
15695800775901642752    0.0 81.0061726043   38.1837661841   0.0
12047011437325700351    0.0 38.1837661841   22.2036033112   7.07106781187
2610937148294873212 0.0 0.0 7.0710678118    30.1330383466


Is it possible to read the file twice, first as alphanumeric for column one only?

0
Entering edit mode

add colClasses=c("factor", NA, NA, NA, NA) to the options.

0
Entering edit mode

Ill try but in reality my matrix will be rather large and the number of NAs would have to be dynamically assigned

0
Entering edit mode

It's likely that the readr package will help, it's better at not changing column names by default.

0
Entering edit mode

I guess you are running the R code only after creating the file, so you can probably get the number of columns beforehand, so you could do:

colClasses = c("character", rep("numeric",4) )


or

colClasses=c("factor", rep(NA, 4) )


Or you could use scan or readLine to read just one line, get the number of columns, and then use that to set rep(NA, columns)