Creating List Of Vectors, From Txt File With Unequal Fields Per Line
1
2
Entering edit mode
12.3 years ago
boczniak767 ▴ 850

I need to transform tab-delimited file like that:

IPR018351 GRMZM2G458776
IPR005731 GRMZM2G047513
IPR005732 GRMZM2G087165 GRMZM2G146818 GRMZM2G427404
IPR018355 GRMZM2G082642 GRMZM2G310283 GRMZM2G406977 GRMZM5G886785

to list of vectors in R or MgsaSets object from mgsa R package

Here's what I have tried.

putative solution 1.

  1. Read my file to R x=read.table("../tymczasowe/x",sep="\t",row.names=1,fill=T)
  2. Transform it to list of vectors x_list=split(x,row(x))

I must say that my longest line is 1616 field long, so I moved it to first line of my orginal file to make read.table read it correctly. split commands caused termination of R. I've tried this procedure on much smaller example and it worked ok.
I've tried also to transform my data.frame to MgsaSets object: annoIP=new("MgsaSets",sets=as.data.frame(t(x)))
command looked successful, but produced one entry more than I expected (one gene more), but I don't know how this additional entry looks (I'm not very advanced in S4 objects).
I tried to perform analysis: xwyn=mgsa(xprb,annoIP) (xprb is just a list of genes to analysis) and I got this error
Error in mgsa.trampoline(o, sets[!isempty], n, alpha = alpha, beta = beta, :
Set index to high (must not exceed 'n')

putative solution 2.
I tried to read the file to a MgsaSets object for mgsa package, I tried to create from it appropriate code and paste it to command-line. Problem here is that code like works for small files x=new("MgsaSets",sets=list(IPR001844=c("AC215201.3_FG005","GRMZM2G009871"
,"GRMZM2G015989")
,IPR005732=c("GRMZM2G087165","GRMZM2G146818","GRMZM2G427404")
...
,IPR018816=c("GRMZM2G072156","GRMZM2G566688")))
but doesn't work for my big file - it is probably too big/long. I got error messages
Error: unexpected ',' in "," after every transition to next line of my pasted code e.g.,IPR023193=c("GRMZM5G877500")
Now I really don't have any idea how I can create desired file.

r list read • 5.8k views
ADD COMMENT
4
Entering edit mode
12.3 years ago

Have a look at this code:

readSets <- function(fname) {
  # read one line at a time
  tmp = readLines(fname)
  # split each line on tab
  tmp2 = sapply(tmp,strsplit,'\t')
  # create a list of the protein IDs by
  # removing the first member of each line
  tmp3 = sapply(tmp2,'[',-1)
  # name the list with the interpro IDs
  names(tmp3) = sapply(tmp2,'[',1)
  # remove any list items with length 0
  # These were blank lines in the original file
  tmp4 = tmp3[sapply(tmp3,length)>0]
  # return result.
  return(tmp4)
}

If test.txt is the filename, you would use the above function like so:

dat = readSets('test.txt')

And "dat" will look like:

$IPR018351
[1] "GRMZM2G458776"

$IPR005731
[1] "GRMZM2G047513"

$IPR005732
[1] "GRMZM2G087165" "GRMZM2G146818" "GRMZM2G427404"

$IPR018355
[1] "GRMZM2G082642" "GRMZM2G310283" "GRMZM2G406977" "GRMZM5G886785"
ADD COMMENT
0
Entering edit mode

Thanks, the code did the work :)

ADD REPLY

Login before adding your answer.

Traffic: 2007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6