Question: Generate column names on the fly R
0
gravatar for Jack
3 months ago by
Jack60
Jack60 wrote:

So I have a gziped test file that looks like this:

##ColumnVariables[gene_id]
##ParemeterValue[genome_assembly]=hg19
##ColumnVariables[column_1]
##ColumnVariables[column_2]
AAA    10    48
BBB    3    99

My code for opening the file looks like this:

# Specify the filename
gzip.file <- "testfile.txt.gz"


# Read the data on a dataframe. 
# If we open the file this way, the first row is going to be the filename. 
# So delete it.
gzip.df <- read.table(gzfile(gzip.file), header=F, fill=T, comment.char = '!')
gzip.df <- gzip.df[-1,]

And what I want to do is use regex to extract all the column_* store them in a vector, delete them from the dataframe, and assign them as colnames

My problem is that since I'm an R newbie and at this point I'm stuck at reading the dataframe line-by-line

Here's my code:

col.names <- c()
for (i in 1:nrow(gzip.df)) {
    #paste (gzip.df[i,])
    if (regexpr('\\[(.*)\\]', gzip.df[i,1])) {
        paste("HI")
        col <- regmatches(test.str, regexpr('\\[(.*)\\]', gzip.df[i,1]))
        col.names <- c(col.names, gzip.df[i,1])
    }
}

So basically I want to read it line-by-line and while my regex is True it's a column name and store it. But it never get's in the if-scope why is that happening?

FYI, this is my 3rd R-script, I decided to move from Python

regex R • 250 views
ADD COMMENTlink modified 3 months ago by Charles Plessy2.1k • written 3 months ago by Jack60
1

Use Unix command-line tools to parse the file into a headered text file, where the first row consists of column headers. Then reading in this modified file into R is trivial with read.table() or fread(). Use R for its strengths, which do not include text parsing.

ADD REPLYlink written 3 months ago by Alex Reynolds20k
1

I agree..... mainly because for loops are not handled well by R (they are very slow) ....preparing the files in the command line before R is a much simpler solution

Also, are the column names identical for all files? If so then just store the header as a string in R and add the string as the first file of every file you read in.

ADD REPLYlink modified 3 months ago • written 3 months ago by kennethcondon2007810
1

Yeah, not to be all chatty and stuff, but using R like this is like when my dad can't find a crescent wrench and so duct-tapes two flathead screwdrivers together. I guess it works.

ADD REPLYlink written 3 months ago by Alex Reynolds20k
0
gravatar for Chris Miller
3 months ago by
Chris Miller18k
Washington University in St. Louis, MO
Chris Miller18k wrote:

You'll probably want to look into the scan() function, and use the nlines parameter. https://stat.ethz.ch/R-manual/R-devel/library/base/html/scan.html

Off the top of my head, I might do something like:

1) read in the first 100 lines of the file with scan, keep only those that match my comment character

2) massage them as necessary to extract the column names

3) re-read the whole file, using read.table with the comment.char param to skip the header

4) assign the column names to the df

ADD COMMENTlink written 3 months ago by Chris Miller18k

The problem with your solution is when the columns are more... This has to be automated...

ADD REPLYlink written 3 months ago by Jack60
0
gravatar for Charles Plessy
3 months ago by
Charles Plessy2.1k
Japan
Charles Plessy2.1k wrote:

The format of your file looks like the Order Switchable Column Table (OSCT) format, but this format requires that "the first line after the comments/meta-data (see below) is a header line, which indicate column names of the table". If the OSCT format was intended by the people who provided you the data, perhaps you can ask them to correct the file. Then, having column names in R will be trivial.

Parsing the metadata to obtain column names could be done, but are you sure that the order of the ColumnVariables metadata will always be the same as the order of the columns ? If not, I recommend to manually check the tables and insert a header line.

If you still want to parse the metadata, maybe you can start with something like the following:

read.oscheader <- function (file) {
    n <- 1
    oscheader <- '#'
    while (substr(oscheader[n] ,1 ,1) == '#') {
        n <- n + 1
        oscheader <- readLines(file, n=n)
    }
    return(oscheader)
}

Once you have a nice parser I welcome a pull request on GitHub at charles-plessy/oscR, a draft package from which I pasted the function above :)

ADD COMMENTlink modified 3 months ago • written 3 months ago by Charles Plessy2.1k

I'm sure that the file will always be correct. Thank you for your code!

ADD REPLYlink written 3 months ago by Jack60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 704 users visited in the last hour