Question: how to avoid R automatically converting strings to numbers
0
gravatar for moushengxu
23 months ago by
moushengxu330
moushengxu330 wrote:

Suppose I have a tab delimited file as the following:

chr1 234 3.24
chr1 345 2.11
chr2 123 8.99
...
chrX 879 0.24
...

Then in R, I use "read.table" to read the file into a variable "d", the head of the "d" looks normal chr1 234 3.24 chr1 345 2.11 chr2 123 8.99 ...

But when I use "cbind(d[,1], d[,2], d[,3])" and assign it to another variable, say, "b", then "b" looks like

1 234 3.24
1 345 2.11
2 123 8.99
...
23 879 0.24 # "chrX" is automatically converted to "23"
...

That is odd. It looks like "cbind" treats characters as factors and used the factor numbers (e.g. 1, 2, ..., 23) to replace the strings (chr1, chr2, ..., chrX).

How to avoid this?

I know this might not be the best forum to ask the question, but since you guys are so great and I believe some of you have the answers!

R software error • 6.5k views
ADD COMMENTlink modified 23 months ago by ddiez1.7k • written 23 months ago by moushengxu330
1

try to use this site to answer your question

http://rseek.org/

ADD REPLYlink written 23 months ago by Medhat7.7k

since you guys are so great

You don't want to do the necessary "re"search on web?

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax55k

I certainly did but found no answers. Weird.

ADD REPLYlink written 23 months ago by moushengxu330

So the problem is while reading the table, strings are read As Factors? Is that True?

ADD REPLYlink written 23 months ago by WouterDeCoster32k

The problem is "cbind" automatically convert strings to factor numbers, e.g. "chr1" => "1", "chrM" => "23", "chrX" => "24", "chrM" to "25".

ADD REPLYlink modified 23 months ago • written 23 months ago by moushengxu330

No the problem is in read.table

ADD REPLYlink written 23 months ago by WouterDeCoster32k

strings are read As Factors? Is that True?

;-)

ADD REPLYlink written 23 months ago by WouterDeCoster32k
7
gravatar for ddiez
23 months ago by
ddiez1.7k
Japan
ddiez1.7k wrote:

Although the point in the comments about stringsAsFactors option is TRUE :-), the real problem in your specific case is that cbind is coercing your data into a matrix. In R, a matrix, by definition, can only have a single data type. All integer, numeric, character or factor. See the following code examples:

# stringsAsFactors = TRUE
# wrong because the factors are coerced as numeric.
d <- data.frame(
  chr = c("A", "B"),
  start = c(1, 2),
  stringsAsFactors = TRUE
)
cbind(d$chr, d$start)
     [,1] [,2]
[1,]    1    1
[2,]    2    2

# stringsAsFactors = FALSE
# wrong because the numbers are coerced as character.
d <- data.frame(
  chr = c("A", "B"),
  start = c(1, 2),
  stringsAsFactors = FALSE
)
cbind(d$chr, d$start)
     [,1] [,2]
[1,] "A"  "1" 
[2,] "B"  "2"

So, if you use cbind, no matter how you set stringsAsFactors originally or whether you use readr or any other tool to read your data you screw, because a matrix can only have one type of data and you have two. The solution is to use a data.frame, which can handle different data types:

data.frame(chr2 = d$chr, start2 = d$start)
  chr2 start2
1    A      1
2    B      2

Don't forget to set stringsAsFactors as desired.

EDIT:

Note that cbind is doing this because you are passing two vectors. If you pass them as data.frame, cbind treats them as such and this problem is avoided:

cbind(d[, "chr", drop = FALSE], d[, "start", drop = FALSE])
  chr start
1   A     1
2   B     2

Of course, this solution is a lot more verbose.

ADD COMMENTlink modified 23 months ago • written 23 months ago by ddiez1.7k

Good catch, I stopped reading after seeing the stringsAsFactors issue.

ADD REPLYlink written 23 months ago by Devon Ryan84k

Thanks. Almost gave up myself because there were a lot of good comments. A love working with R but these nuances can be really frustrating.

ADD REPLYlink written 23 months ago by ddiez1.7k

This is the best answer!

"cbind" causes a lot of problems, and using "data.frame" the way you mentioned resolved all the troublesome issues.

ADD REPLYlink written 23 months ago by moushengxu330
3
gravatar for Devon Ryan
23 months ago by
Devon Ryan84k
Freiburg, Germany
Devon Ryan84k wrote:

This is a benefit of using the readr package rather than base R when reading tables, the stringsAsFactors option (this is what you were looking for) is set in a more coherent way.

ADD COMMENTlink written 23 months ago by Devon Ryan84k

actually just checked, "read.table" takes "stringsAsFactors" and it worked!

Thanks!

ADD REPLYlink written 23 months ago by moushengxu330

Remember to "accept" the answers (use check mark against the answers) that solved your problem. You can choose more than one.

ADD REPLYlink written 23 months ago by genomax55k
3
gravatar for Alex Reynolds
23 months ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

The correct way -- and by correct, I mean correct: to specify with complete precision -- to solve this is to use colClasses with read.table, which coerces columns into type classes, like character, numeric, factor, etc.

For instance, in your case:

read.table(someFile, ..., colClasses=c("character", "numeric", "numeric"))

See ?read.table for more information.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Alex Reynolds25k
1

OK, this works.

However, if the column names of "someFile" varies from file to file, it would be impossible to predefine which column is "character" or "numeric". Is there a way to force "cbind"?

Thanks.

ADD REPLYlink written 23 months ago by moushengxu330
1

You can pass stringsAsFactors = FALSE to read.table so no need to specify all the column classes.

ADD REPLYlink written 23 months ago by ddiez1.7k

Also argument as.is = TRUE does the same trick.

ADD REPLYlink written 23 months ago by ddiez1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1312 users visited in the last hour