Question: read.table does not read in all rows!
2
gravatar for Parham
21 months ago by
Parham1.3k
Sweden
Parham1.3k wrote:

Hi, I encountered a strange issue while reading in a data table from txt format. If I read it from txt by read.table it does not include all rows but if I convert to csv and read it with read.csv its perfect. Does someone know the issue or is it my code?

This is the file.

> test <- read.table("./Annotations/all_genes_pombase.txt",
+ header=T,
+ sep="\t",
+ row.names=1,
+ stringsAsFactors = F)
> dim(test)
[1] 4533    7
> str(test)
'data.frame':   4533 obs. of  7 variables:
 $ name        : chr  "SPAC1002.01" "pom34" "gls2" "taf11" ...
 $ chromosome  : chr  "I" "I" "I" "I" ...
 $ description : chr  "conserved fungal protein " "nucleoporin Pom34 " "glucosidase II alpha subunit Gls2 " "transcription factor TFIID complex subunit Taf11 (predicted) " ...
 $ feature_type: chr  "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
 $ strand      : int  1 1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ start       : int  1798347 1799061 1799915 1803624 1804548 1807270 1807996 1809480 1811408 1813740 ...
 $ end         : int  1799015 1800053 1803141 1804491 1806797 1807781 1809433 1811361 1813805 1815796 ...
>
read.table R • 4.7k views
ADD COMMENTlink modified 21 months ago by Santosh Anand3.7k • written 21 months ago by Parham1.3k
1

try using read.delim instead and specifying the corresponding arguments for your text file

ADD REPLYlink written 21 months ago by steve1.6k

Thanks steve and Devon! read.delim works fine!

ADD REPLYlink written 21 months ago by Parham1.3k

also it would be more useful to see the actual source txt file, using something like head in the terminal

ADD REPLYlink written 21 months ago by steve1.6k
1

The file is linked to in the post.

ADD REPLYlink written 21 months ago by Devon Ryan82k

It works fine with read.delim, for whatever that's worth.

ADD REPLYlink written 21 months ago by Devon Ryan82k
8
gravatar for Santosh Anand
21 months ago by
Santosh Anand3.7k
Santosh Anand3.7k wrote:

Use the argument quote = "" inside read.table.

read.table("your_file", quote="", other.arguments)

Explanation: Your data has a single quote on 59th line (( pyridoxamine 5'-phosphate oxidase (predicted)). Then there is another single quote, which complements the single quote on line 59, is on line 137 (5'-hydroxyl-kinase activity...). Everything within quote will be read as a single field of data, and quotes can include the newline character also. That's why you lose the lines in between. quote = "" disables quoting altogether.

There are other more instances where this 'quoting' happens again. One way to know how many fields read.table sees in every row is by using count.fields

num.fields = count.fields("all_genes_pombase.txt", sep="\t")

Now look at the variable num.fields, there will be a lot of NAs, the lines which are not read correctly by read.table

The problem doesn't arise with read.csv because the quoting defaults are different in read.table and read.csv, due to some reason really unknown to me!

read.table: quote = "\"'"
read.csv: quote = "\""

PS: The best way to avoid the reading file nuisance of read.table is to use fread() from data.table package. The side benefit is that it's blazing fast for large files and it guesses the field separator automatically. See my earlier post: A: How to import huge .csv files in R studio?

ADD COMMENTlink modified 21 months ago • written 21 months ago by Santosh Anand3.7k

Glad I refreshed before posting. I had just noticed the quoting issue, but your explanation is much more detailed than mine would have been :)

ADD REPLYlink modified 21 months ago • written 21 months ago by Devon Ryan82k

thanks for the appreciation :)

ADD REPLYlink written 21 months ago by Santosh Anand3.7k

Thanks a million for thorough explanation and troubleshoot! Very handy tips =)

ADD REPLYlink written 21 months ago by Parham1.3k

Happy that it was helpful :). read.table is one of R's worst nightmare

ADD REPLYlink written 21 months ago by Santosh Anand3.7k

Oh I need to edit a bunch of scripts to use fread() from now on. Lovely when R trips you up like this.

ADD REPLYlink written 21 months ago by WouterDeCoster31k

you better do sooner than later ;-)

ADD REPLYlink written 21 months ago by Santosh Anand3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 573 users visited in the last hour