Question

Why am I unable to load my data from a tab separated file into R?

1

Entering edit mode

9.2 years ago

Mo ▴ 920

Hello,

I don't know why I am getting some errors during my analysis

I uploaded an example of my data in

I use the following command in R to load my data

data <- read.delim("path to your file /example.txt", header=FALSE)

however, in summary or head or other commands I look at the data, it seems alright but I cannot analysis since it gives error like all numeric variables. For example if you want to get the range of the example data, you will get such error.

How normally do you import, load a microarray data (with txt format) (each row represents a prob and each column a sample)?

Thanks

microarray programming R • 12k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

1

Entering edit mode

Please provide the file in an external link, it takes quite long to load and to scroll all the way down your table.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by dago ★ 2.8k

0

Entering edit mode

After reading several comments from you below, it seems that reading the help for range() might be useful to you. In particular, it tells you that range works on any numeric or character objects. Data frames are not numeric though they can contain numbers. Also, range(), like almost all aggregation functions in R, will return NA when there is any NA in the data unless told to do otherwise.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Sean Davis 26k

2

Entering edit mode

9.2 years ago

dariober 14k

I think you want:

data <- read.delim("path to your file /example.txt", header=TRUE)

Note header=TRUE. With header=FALSE, all the columns are set to factor by default, probably you want all but the first column to be numeric.

(By the way, for the future reporting the exact error message and the command that generated it would help)

Dario

ADD COMMENT • link 9.2 years ago by dariober 14k

0

Entering edit mode

Hi Dario,

Thanks for your comment, but I still get the following error when I turn the header to true

> range.raw <- range(example)
Error in FUN(X[[1L]], ...) : 
  only defined on a data frame with all numeric variables

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

1

Entering edit mode

9.2 years ago

Jonathan Dursi ▴ 270

read.delim is just a wrapper for read.table(); it's often easier to just use read.table() and let it infer separations, column types, etc. In particular,

df <- read.table('/path/to/example.txt.txt',header=TRUE)

works file for me on your data.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

The same I still get error

range.raw <- range(example)
Error in FUN(X[[1L]], ...) : 
  only defined on a data frame with all numeric variables

What I did was to keep the probes ID in a separate file as follows:

rprobes <- example[,1]

then tried to get the data matrix by using the following function.

data <- data.matrix(example[,2:ncol(example)])

it seems that data.matrix changes the value of my data

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

2

Entering edit mode

Well, the first column certainly doesn't hold 'all numeric variables'. Shouldn't you define row.names=1 or something like that for read.table? Those error messages more often than not tell what the problem is..

ADD REPLY • link 9.2 years ago by 5heikki 11k

1

Entering edit mode

But your data file doesn't have all numeric variables, so of course your data frame doesn't have all numeric variables. What are you planning to do with the probe IDs?

If it's alright to just ignore them, just strip them off

df2 <- df[,-c(1)]

and proceed using df2; or pass off the data to (say) range with something like

range(df[,-c(1)])
[1] -20.091  25.652

Otherwise, if you want to (say) strip of the _at, _x_at, and _s_ats, and treat the rest as a number (I have no idea if that's ok; will the remaining ids be unique? Should they be?), you can do that easily enough too:

df[,1] <- as.numeric(gsub("_.*","",as.character(df[,1])))

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Jonathan Dursi ▴ 270

0

Entering edit mode

9.2 years ago

Ram 43k

Looks like your data might not be delimited properly (space-delimited with random spaces between columns). You might either wanna check that out or explore how to treat consecutive delimiters as one.

ADD COMMENT • link 2.0 years ago by Ram 43k

0

Entering edit mode

if I cat -A <file> it looks well formatted.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by dago ★ 2.8k

0

Entering edit mode

So, all tabs?

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

It looks like that

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by dago ★ 2.8k

0

Entering edit mode

Hi Ram, Might be the problem but I am working on it and so far could not find whether the problem is

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

1

Entering edit mode

You said this gist is just the sample, right? If you could give us the actual file, we can figure out what the problem is.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

9.2 years ago

dago ★ 2.8k

You might want to try to set

read.delim(...stringsaAsFactors =False)

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by dago ★ 2.8k

0

Entering edit mode

I think stringsaAsFactors =False is for data.table and not read.delim

However, this also does not help

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

9.2 years ago

Manvendra Singh ★ 2.2k

read.table("file", header=TRUE, stringsAsFactors=FALSE, sep="\t", dec=".")

should work

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Thanks for your comment but it does not work unfortunately

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

Dude, then you must try to import your file in excel sheet, if it gets imported then copy and paste from excel to notepad and then read it in R

ADD REPLY • link 9.2 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

I would not do that, imagine you have over 40000 probes and 1500 samples, would you personally copy and paste in the notepad?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

Yes, Atleast first 10 rows to see what exactly problem is

ADD REPLY • link 9.2 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Excel is not what I'd recommend, but bulk copy-paste is super easy. Ctrl+Shift+direction will copy to the last row or column of the range in use in the direction you select.

ADD REPLY • link 9.2 years ago by Ram 43k

0

Entering edit mode

For sure it is easy if and only if your data is small. It is definitely not a good way to copy and paste your files over and over because of systematic error!

I personally avoid such things but for sure if you are working with 20 samples and 100 variables , more or less, it would be convenience to do such things

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

I'd probably use it to ensure data type consistency. I avoid Excel as much as possible. UNIX is much much better with large files.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

9.2 years ago

TriS ★ 4.7k

This worked for me

df <- read.table("test.txt", header=T, row.names = 1)
df <- apply(df, 2, function(x) sapply(x, as.numeric))
range(df)

> range(df)
-20.091  25.652

The key is to apply to each string the sapply function to make it numeric

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by TriS ★ 4.7k

1

Entering edit mode

Thanks for your comment, however, it does not work for me.

When I ran the following command on my data, I got an error which means there are some blank

df <- read.table("test.txt", header=T, row.names = 1)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 1085 elements

I solved the problem by adding fill =TRUE

However, I am sure I have 1500 columns but the data was wither 1800 columns

Then I said ok, might the problem solve by the other command

df <- apply(df, 2, function(x) sapply(x, as.numeric))

But it did not, and of course the range was NA NA. Any clue where the problem might be?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

Add sep="\t"to the read.file() or use read.delim() but it looks like I am running a little late and you already go an answer :).

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by TriS ★ 4.7k

0

Entering edit mode

Hello TriS,

Thanks for your comment, Yes I have got to the answer. My mistake was because of header =False and also I did not use the row.names=1 when I was importing my file by read.delim function

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

Ram · Accepted Answer · 2015-02-08

3

Entering edit mode

9.2 years ago

Michael 54k

Works perfectly for me:

> dat <- read.delim(file="Downloads/gist2c69ab500bfa94d0268a-ac4cd3d5b0d0764c2faae0e3fb0db8a39d75bb22/example.txt", row.names=1)
# your mistake was to set header=FALSE, and to omit 
# row.names=1 

> head(dat)
                 M1      M2      M3      M4      M5      M6      M7      M8      M9     M10     M11     M12
200645_at    0.0446  0.0744 -0.0340  0.0173  0.2280  0.0070 -0.0250  0.0644 -0.0253 -0.1230 -0.6251  0.0210
200690_at   -0.0165  0.1121 -0.0959  0.0000 -0.4595 -0.0282 -0.1617 -0.0482 -0.2611  0.0223 -0.6129  0.1961
200691_s_at  0.0554 -0.0689 -0.0852  0.0702  0.0823  0.0361 -0.0306 -0.0076 -0.0340 -0.0198 -0.1823 -0.0681
200692_s_at  0.0000 -0.0505 -0.0508 -0.0159 -0.3041 -0.0684 -0.0644 -0.0175  0.0503  0.0546 -0.2141 -0.0216
200693_at    0.0608  0.0601  0.0115  0.0744 -0.0232 -0.1095 -0.0416 -0.0499 -0.0515  0.0303 -0.1153  0.0824
200694_s_at  0.0424  0.0957  0.0758 -0.0387 -0.0517 -0.0207  0.0328 -0.1392  0.0140 -0.1476  0.1382  0.0113
                M13     M14     M15     M16     M17     M18
200645_at    0.1095  0.1527  0.0261 -0.2107 -0.0196 -0.2316
200690_at    0.2119  0.0122 -0.5495  0.1518 -0.2409  0.1610
200691_s_at  0.1219 -0.1615 -0.0729 -0.0696  0.0042  0.1239
200692_s_at  0.0440 -0.0811  0.0964  0.0211 -0.0325  0.1810
200693_at   -0.0036  0.0575  0.0427  0.1104 -0.0216  0.0278
200694_s_at  0.2247  0.1489  0.0196  0.0883 -0.1848  0.1989

> range(dat)
[1] -20.091  25.652

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Michael 54k

2

Entering edit mode

Thanks Michael for such a valuable comment and very sharp to the point!!!

That is certainly the answer, I was looking for. I used something like below and it works just fine !

mydata <- read.delim(file="path to the data.txt", header=TRUE, row.names=1)

str(mydata) # to see the structure of my data
head(mydata, n=1) # to check the first line of my data
tail(mydata, n=1) # to check the last line of my data

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

OP pasted >1500 lines and called it a sample, actually meant sample - the original is 40K lines. I think somewhere down the line the datatype gets messed up.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

Well, then I think he is would be wasting our time with incomplete information....

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Michael 54k

1

Entering edit mode

I am not wasting anybody time! a portion of the data which is not publicly available represent the entire data structure!

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

Seemingly, if the code works, your data was representative, and I take it back :D In general it is tricky to debug given an incomplete reference dataset. I understand that some part of your data is private, however this can end up in 'works for me' -> 'doesn't work' ... cycles, where each person is talking about a different data-set. It is very important to help the people trying to help with as much information as possible, or people might get frustrated.

ADD REPLY • link 9.2 years ago by Michael 54k