Question

Creating List Of Vectors, From Txt File With Unequal Fields Per Line

2

Entering edit mode

13.8 years ago

boczniak767 ▴ 880

I need to transform tab-delimited file like that:

IPR018351 GRMZM2G458776
IPR005731 GRMZM2G047513
IPR005732 GRMZM2G087165 GRMZM2G146818 GRMZM2G427404
IPR018355 GRMZM2G082642 GRMZM2G310283 GRMZM2G406977 GRMZM5G886785

to list of vectors in R or MgsaSets object from mgsa R package

Here's what I have tried.

putative solution 1.

Read my file to R x=read.table("../tymczasowe/x",sep="\t",row.names=1,fill=T)
Transform it to list of vectors x_list=split(x,row(x))

I must say that my longest line is 1616 field long, so I moved it to first line of my orginal file to make read.table read it correctly. split commands caused termination of R. I've tried this procedure on much smaller example and it worked ok.
I've tried also to transform my data.frame to MgsaSets object: annoIP=new("MgsaSets",sets=as.data.frame(t(x)))
command looked successful, but produced one entry more than I expected (one gene more), but I don't know how this additional entry looks (I'm not very advanced in S4 objects).
I tried to perform analysis: xwyn=mgsa(xprb,annoIP) (xprb is just a list of genes to analysis) and I got this error
Error in mgsa.trampoline(o, sets[!isempty], n, alpha = alpha, beta = beta, :
Set index to high (must not exceed 'n')

putative solution 2.
I tried to read the file to a MgsaSets object for mgsa package, I tried to create from it appropriate code and paste it to command-line. Problem here is that code like works for small files x=new("MgsaSets",sets=list(IPR001844=c("AC215201.3_FG005","GRMZM2G009871"
,"GRMZM2G015989")
,IPR005732=c("GRMZM2G087165","GRMZM2G146818","GRMZM2G427404")
...
,IPR018816=c("GRMZM2G072156","GRMZM2G566688")))
but doesn't work for my big file - it is probably too big/long. I got error messages
Error: unexpected ',' in "," after every transition to next line of my pasted code e.g.,IPR023193=c("GRMZM5G877500")
Now I really don't have any idea how I can create desired file.

r list read • 6.4k views

ADD COMMENT • link updated 13.8 years ago by Sean Davis 27k • written 13.8 years ago by boczniak767 ▴ 880

score 4 · Answer 1 · 2012-01-23

Have a look at this code:

readSets <- function(fname) {
  # read one line at a time
  tmp = readLines(fname)
  # split each line on tab
  tmp2 = sapply(tmp,strsplit,'\t')
  # create a list of the protein IDs by
  # removing the first member of each line
  tmp3 = sapply(tmp2,'[',-1)
  # name the list with the interpro IDs
  names(tmp3) = sapply(tmp2,'[',1)
  # remove any list items with length 0
  # These were blank lines in the original file
  tmp4 = tmp3[sapply(tmp3,length)>0]
  # return result.
  return(tmp4)
}

If test.txt is the filename, you would use the above function like so:

dat = readSets('test.txt')

And "dat" will look like:

$IPR018351
[1] "GRMZM2G458776"

$IPR005731
[1] "GRMZM2G047513"

$IPR005732
[1] "GRMZM2G087165" "GRMZM2G146818" "GRMZM2G427404"

$IPR018355
[1] "GRMZM2G082642" "GRMZM2G310283" "GRMZM2G406977" "GRMZM5G886785"