Question

Generate column names on the fly R

0

Entering edit mode

7.1 years ago

Jack ▴ 120

So I have a gziped test file that looks like this:

##ColumnVariables[gene_id]
##ParemeterValue[genome_assembly]=hg19
##ColumnVariables[column_1]
##ColumnVariables[column_2]
AAA    10    48
BBB    3    99

My code for opening the file looks like this:

# Specify the filename
gzip.file <- "testfile.txt.gz"


# Read the data on a dataframe. 
# If we open the file this way, the first row is going to be the filename. 
# So delete it.
gzip.df <- read.table(gzfile(gzip.file), header=F, fill=T, comment.char = '!')
gzip.df <- gzip.df[-1,]

And what I want to do is use regex to extract all the column_* store them in a vector, delete them from the dataframe, and assign them as colnames

My problem is that since I'm an R newbie and at this point I'm stuck at reading the dataframe line-by-line

Here's my code:

col.names <- c()
for (i in 1:nrow(gzip.df)) {
    #paste (gzip.df[i,])
    if (regexpr('\\[(.*)\\]', gzip.df[i,1])) {
        paste("HI")
        col <- regmatches(test.str, regexpr('\\[(.*)\\]', gzip.df[i,1]))
        col.names <- c(col.names, gzip.df[i,1])
    }
}

So basically I want to read it line-by-line and while my regex is True it's a column name and store it. But it never get's in the if-scope why is that happening?

FYI, this is my 3rd R-script, I decided to move from Python

R regex • 2.9k views

ADD COMMENT • link updated 7.1 years ago by Charles Plessy ★ 2.9k • written 7.1 years ago by Jack ▴ 120

1

Entering edit mode

Use Unix command-line tools to parse the file into a headered text file, where the first row consists of column headers. Then reading in this modified file into R is trivial with read.table() or fread(). Use R for its strengths, which do not include text parsing.

ADD REPLY • link 7.1 years ago by Alex Reynolds 35k

1

Entering edit mode

I agree..... mainly because for loops are not handled well by R (they are very slow) ....preparing the files in the command line before R is a much simpler solution

Also, are the column names identical for all files? If so then just store the header as a string in R and add the string as the first file of every file you read in.

ADD REPLY • link 7.1 years ago by BioinfGuru ★ 1.7k

1

Entering edit mode

Yeah, not to be all chatty and stuff, but using R like this is like when my dad can't find a crescent wrench and so duct-tapes two flathead screwdrivers together. I guess it works.

ADD REPLY • link 7.1 years ago by Alex Reynolds 35k

score 0 · Answer 1 · 2017-04-21

0

Entering edit mode

7.1 years ago

Chris Miller 22k

You'll probably want to look into the scan() function, and use the nlines parameter. https://stat.ethz.ch/R-manual/R-devel/library/base/html/scan.html

Off the top of my head, I might do something like:

1) read in the first 100 lines of the file with scan, keep only those that match my comment character

2) massage them as necessary to extract the column names

3) re-read the whole file, using read.table with the comment.char param to skip the header

4) assign the column names to the df

ADD COMMENT • link 7.1 years ago by Chris Miller 22k

0

Entering edit mode

The problem with your solution is when the columns are more... This has to be automated...

ADD REPLY • link 7.1 years ago by Jack ▴ 120

score 0 · Answer 2 · 2017-04-21

The format of your file looks like the Order Switchable Column Table (OSCT) format, but this format requires that "the first line after the comments/meta-data (see below) is a header line, which indicate column names of the table". If the OSCT format was intended by the people who provided you the data, perhaps you can ask them to correct the file. Then, having column names in R will be trivial.

Parsing the metadata to obtain column names could be done, but are you sure that the order of the ColumnVariables metadata will always be the same as the order of the columns ? If not, I recommend to manually check the tables and insert a header line.

If you still want to parse the metadata, maybe you can start with something like the following:

read.oscheader <- function (file) {
    n <- 1
    oscheader <- '#'
    while (substr(oscheader[n] ,1 ,1) == '#') {
        n <- n + 1
        oscheader <- readLines(file, n=n)
    }
    return(oscheader)
}

Once you have a nice parser I welcome a pull request on GitHub at charles-plessy/oscR, a draft package from which I pasted the function above :)