Question: Generate column names on the fly R
gravatar for Jack
3.1 years ago by
Jack120 wrote:

So I have a gziped test file that looks like this:

AAA    10    48
BBB    3    99

My code for opening the file looks like this:

# Specify the filename
gzip.file <- "testfile.txt.gz"

# Read the data on a dataframe. 
# If we open the file this way, the first row is going to be the filename. 
# So delete it.
gzip.df <- read.table(gzfile(gzip.file), header=F, fill=T, comment.char = '!')
gzip.df <- gzip.df[-1,]

And what I want to do is use regex to extract all the column_* store them in a vector, delete them from the dataframe, and assign them as colnames

My problem is that since I'm an R newbie and at this point I'm stuck at reading the dataframe line-by-line

Here's my code:

col.names <- c()
for (i in 1:nrow(gzip.df)) {
    #paste (gzip.df[i,])
    if (regexpr('\\[(.*)\\]', gzip.df[i,1])) {
        col <- regmatches(test.str, regexpr('\\[(.*)\\]', gzip.df[i,1]))
        col.names <- c(col.names, gzip.df[i,1])

So basically I want to read it line-by-line and while my regex is True it's a column name and store it. But it never get's in the if-scope why is that happening?

FYI, this is my 3rd R-script, I decided to move from Python

regex R • 1.7k views
ADD COMMENTlink modified 3.1 years ago by Charles Plessy2.7k • written 3.1 years ago by Jack120

Use Unix command-line tools to parse the file into a headered text file, where the first row consists of column headers. Then reading in this modified file into R is trivial with read.table() or fread(). Use R for its strengths, which do not include text parsing.

ADD REPLYlink written 3.1 years ago by Alex Reynolds30k

I agree..... mainly because for loops are not handled well by R (they are very slow) ....preparing the files in the command line before R is a much simpler solution

Also, are the column names identical for all files? If so then just store the header as a string in R and add the string as the first file of every file you read in.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by YaGalbi1.5k

Yeah, not to be all chatty and stuff, but using R like this is like when my dad can't find a crescent wrench and so duct-tapes two flathead screwdrivers together. I guess it works.

ADD REPLYlink written 3.1 years ago by Alex Reynolds30k
gravatar for Chris Miller
3.1 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

You'll probably want to look into the scan() function, and use the nlines parameter.

Off the top of my head, I might do something like:

1) read in the first 100 lines of the file with scan, keep only those that match my comment character

2) massage them as necessary to extract the column names

3) re-read the whole file, using read.table with the comment.char param to skip the header

4) assign the column names to the df

ADD COMMENTlink written 3.1 years ago by Chris Miller21k

The problem with your solution is when the columns are more... This has to be automated...

ADD REPLYlink written 3.1 years ago by Jack120
gravatar for Charles Plessy
3.1 years ago by
Charles Plessy2.7k
Charles Plessy2.7k wrote:

The format of your file looks like the Order Switchable Column Table (OSCT) format, but this format requires that "the first line after the comments/meta-data (see below) is a header line, which indicate column names of the table". If the OSCT format was intended by the people who provided you the data, perhaps you can ask them to correct the file. Then, having column names in R will be trivial.

Parsing the metadata to obtain column names could be done, but are you sure that the order of the ColumnVariables metadata will always be the same as the order of the columns ? If not, I recommend to manually check the tables and insert a header line.

If you still want to parse the metadata, maybe you can start with something like the following:

read.oscheader <- function (file) {
    n <- 1
    oscheader <- '#'
    while (substr(oscheader[n] ,1 ,1) == '#') {
        n <- n + 1
        oscheader <- readLines(file, n=n)

Once you have a nice parser I welcome a pull request on GitHub at charles-plessy/oscR, a draft package from which I pasted the function above :)

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Charles Plessy2.7k

I'm sure that the file will always be correct. Thank you for your code!

ADD REPLYlink written 3.1 years ago by Jack120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 892 users visited in the last hour