how to read a specific CSV file in R
2
0
Entering edit mode
6.4 years ago
Mo ▴ 920

Hello,

Since I could not really find a solution for my previous questions related to protein and complexes. I am trying to work on i

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names​*


Is there anyone know how many columns I should expect from this input ? how to fix this problem?

r proteomics • 14k views
2
Entering edit mode

As I saw in the linked file, it probably is not a csv file, but ";" sep file. Maybe using read.table and quote="", and comment.char = "", sep=";". Even you think it's a csv file, still add quote="", and comment.char = "" as they are the main reason of more columns than column names​.

0
Entering edit mode

@Zhilong Jia your comments pushed me toward a good answer. when I remove the header from the file and I get it with data.table, the loaded file seems to be alright , however, I dont know why when I get it with header it is not working well . I liked your answer anyway, thanks

0
Entering edit mode

Check your .csv file in a text editor (gedit/notepad etc) to see if you have whitespace at the end of lines, or a missing header. That's often the problem in my experience.

0
Entering edit mode

@Daniel header should be fine , because it is a database. There is a problem with the rest that i dont know which section belongs to which header .

0
Entering edit mode

Your data has a semi-colon separator your actual data has lots of comas in it. Change your input line to this and it should work:

dat = read.table("path to the data/allComplexes.csv", sep=";", header = TRUE)

0
Entering edit mode

@Daniel thanks a good answer was given below which I accepted.

0
Entering edit mode

Your header is actually not comma separated. Do head -1 allComplxes.csv and check. If you do header=FALSE, it will load the data.

0
Entering edit mode

@venu If you turn off the header, it will load it , but it does not make any sense. Please look at it. it mixed the IDs with names

0
Entering edit mode

I've only checked the header, didn't load into R. Edited. Open in google spreadsheet and you'll come to know how many fields are actually comma separated. Edit whole file with some bash commands then load it into R.

1
Entering edit mode
6.4 years ago
# open the file in a text editor and remove the last ';' from the header line (this makes it 13 columns but data as 12 or its missing the complete information in that column.

4  11  12  13
4   1 444   1 ​


Above tells you, you have 2 uneven rows, one with 11 and other with 13 columns.

# the read the file in normally, this should load it up and sort all problems, read about the parameters I used here


Cheers

0
Entering edit mode

@Sukhdeep Singh nice,   fill=TRUE is the key !

0
Entering edit mode

@Sukhdeep Singh

Would you mind letting me know what is different between your output and my solution? Yours is obviously more elegant, but being new to R i'd really like to know if my solution does not work, and why that would be.

They look similar in R Studio, yours has some "'s in empty fields, but i'm having trouble seeing anything else.

Thank you!

1
Entering edit mode

Hi Carlos, your solution is not wrong, its fine. Just don't use a package (external) if you don't need to. Secondly, the problem was an extra semicolon (which might or might not be a part of the header), scrutinizing the file gave me the hint, so we don't have to do all the extra work of renaming columns etc etc.

0
Entering edit mode

Excellent, thank you very much.

0
Entering edit mode
6.4 years ago
cbio ▴ 450
install.packages('data.table')
library('data.table')


This will produce an error and remove the first line of the file (your headers), but the data should load fine.

Seems there are not enough headers so you will have to manually add them in some sort of text editor, or in R once you've read in the files using the names function (shouldn't be too bad since there aren't many columns, but I agree this is frustrating), but otherwise it doesn't really matter.

0
Entering edit mode

0
Entering edit mode

I've looked through the data a bit, and I agree with @venu that the header is not comma separated. It seems to be separated by semi-colons. Some of the columns have commas in them which is throwing off the read.csv function.

Give the above a try and then do something like names(dat) <- c("names", "go", "here") using either the naming from the csv file you downloaded or your own. This should work, or at least it did for me.

0
Entering edit mode

@Carlos Guzman Thanks for your comment but that does not give what it should. Once you want to load a data (it can be loaded easily) once you want to make sense out of it (cannot be that easy).  The header is standard because this is a database and I must not change it (means add or remove). The other part should be fit but not messed up. if I remove the header, everything is missed up. (not from computer point of view but from a biology information)

0
Entering edit mode

We are not removing anything are we? We are adding in the same header names. When I write my file out it looks identical to the original. Have you tried my solution? I am able to re-create the exact file by doing my above steps and then running

names(dat) <- c("Complex id",
"Complex name",
"Synonyms",
"organism",
"subunits (unitprot id)",
"subunits (entrez id)",
"protein complex purification method",
"pubmed id",
"funcat categories",
"functional comment",
"disease comment",
"subunit comment")


I must definitely be mis-understanding what you mean by "everything is messed up". When I head(dat) everything seems to be in the correct column when compared to the original file.

0
Entering edit mode

@Carlos Guzman I give you an small hint that you get the point, look at the first cell after you load it into R , 1;BCL6-HDAC4 complex;;Human;P41182. This includes 3 cells + half of another cell. simply ,

Complex id is 1,
Complex name is BCL6-HDAC4 complex
Synonyms  does not have any
organism human
subunits (unitprot id) P41182

0
Entering edit mode

I'm not getting the hint. Are you saying there is something wrong with the data in these columns? Are you assuming that every column MUST have data?

0
Entering edit mode

@Carlos Guzman  Please load the data with the accepted command. Then you will see how the data should look. the problem was that the format of data is not right , a external sign added to the file etc. However, the accepted answer gave the right one which I was looking for. Just for your knowledge, each of these columns means something in biology, once you have them separated, you can look for a question you have in mind. If you have all together, it is more difficult to find it. however, please look at the accepted answer , if you still did not get it, please let me know

0
Entering edit mode

I went ahead and looked at both, the outputs are identical. The only difference is that accepted answer has some "s in some of the fields to denote empty fields. They both have the exact same dimensions (2867 obs of 12 variables).

Either way, i'm glad you found a solution that worked!