Question: Creating List Of Vectors, From Txt File With Unequal Fields Per Line
gravatar for boczniak767
9.1 years ago by
boczniak767700 wrote:

I need to transform tab-delimited file like that:

IPR018351 GRMZM2G458776
IPR005731 GRMZM2G047513
IPR005732 GRMZM2G087165 GRMZM2G146818 GRMZM2G427404
IPR018355 GRMZM2G082642 GRMZM2G310283 GRMZM2G406977 GRMZM5G886785

to list of vectors in R or MgsaSets object from mgsa R package

Here's what I have tried.

putative solution 1.

  1. Read my file to R x=read.table("../tymczasowe/x",sep="\t",row.names=1,fill=T)
  2. Transform it to list of vectors x_list=split(x,row(x))

I must say that my longest line is 1616 field long, so I moved it to first line of my orginal file to make read.table read it correctly. split commands caused termination of R. I've tried this procedure on much smaller example and it worked ok.
I've tried also to transform my data.frame to MgsaSets object: annoIP=new("MgsaSets",
command looked successful, but produced one entry more than I expected (one gene more), but I don't know how this additional entry looks (I'm not very advanced in S4 objects).
I tried to perform analysis: xwyn=mgsa(xprb,annoIP) (xprb is just a list of genes to analysis) and I got this error
Error in mgsa.trampoline(o, sets[!isempty], n, alpha = alpha, beta = beta, :
Set index to high (must not exceed 'n')

putative solution 2.
I tried to read the file to a MgsaSets object for mgsa package, I tried to create from it appropriate code and paste it to command-line. Problem here is that code like works for small files x=new("MgsaSets",sets=list(IPR001844=c("AC215201.3_FG005","GRMZM2G009871"
but doesn't work for my big file - it is probably too big/long. I got error messages
Error: unexpected ',' in "," after every transition to next line of my pasted code e.g.,IPR023193=c("GRMZM5G877500")
Now I really don't have any idea how I can create desired file.

R list read • 4.9k views
ADD COMMENTlink written 9.1 years ago by boczniak767700
gravatar for Sean Davis
9.1 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Have a look at this code:

readSets <- function(fname) {
  # read one line at a time
  tmp = readLines(fname)
  # split each line on tab
  tmp2 = sapply(tmp,strsplit,'\t')
  # create a list of the protein IDs by
  # removing the first member of each line
  tmp3 = sapply(tmp2,'[',-1)
  # name the list with the interpro IDs
  names(tmp3) = sapply(tmp2,'[',1)
  # remove any list items with length 0
  # These were blank lines in the original file
  tmp4 = tmp3[sapply(tmp3,length)>0]
  # return result.

If test.txt is the filename, you would use the above function like so:

dat = readSets('test.txt')

And "dat" will look like:

[1] "GRMZM2G458776"

[1] "GRMZM2G047513"

[1] "GRMZM2G087165" "GRMZM2G146818" "GRMZM2G427404"

[1] "GRMZM2G082642" "GRMZM2G310283" "GRMZM2G406977" "GRMZM5G886785"
ADD COMMENTlink written 9.1 years ago by Sean Davis26k

Thanks, the code did the work :)

ADD REPLYlink written 9.1 years ago by boczniak767700
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2363 users visited in the last hour