Question

Reading Multiple CSV files and perform a logistic regression for all those files separately

0

Entering edit mode

3.6 years ago

paramount.amin • 0

I am trying to perform a logistic regression analysis on multiple CSV files. All the files have the same column's name ( DateTime, HvacMode, Event, Schedule,...). How can I load all the CSV files in the folder and perform logistic regression using one Rstudio code for all the CSV files and I will be able to refer to each analysis's result in the future. My goal is to find a pattern in different CSV samples.

for a single file I am using:

glm (Event ~ T_ctrl + T_out+RH_out+ T_stp_cool+T_stp_heat+Humi, data=logistic, family=binomial(link="logit"))

and I need to do this for all the csv files in WD.

Thank you

R • 2.0k views

ADD COMMENT • link updated 3.6 years ago by zx8754 11k • written 3.6 years ago by paramount.amin • 0

score 3 · Answer 1 · 2020-09-06

3

Entering edit mode

3.6 years ago

rpolicastro 13k

base R answer

# Replace '...' with the path to the directory with your files.
files <- list.files(..., pattern="\\.csv$", full.names=TRUE)
files <- setNames(files, basename(files))

results <- lapply(files,  function(x) {
  df <- read.csv(x, stringsAsFactors=FALSE)
  df$fan <- as.numeric(df$fan > 0)
  result <- glm(
    Event ~ T-ctrl + T_out + RH_out + T_stp_cool + T_stp_heat + Humi,
    data=df, family=binomial(link="logit")
  )
  result <- summary(result)$coefficients
  return(result)
})

data.table will speed things up a bit if the files are big and/or you have a lot of files.

library("data.table")

# Replace '...' with the path to the directory with your files.
files <- list.files(..., pattern="\\.csv$", full.names=TRUE)
files <- setNames(files, basename(files))

results <- lapply(files,  function(x) {
  DT <- fread(x, sep=",")
  set(DT, j="fan", value=as.numeric(DT[, fan] > 0))
  result <- glm(
    Event ~ T-ctrl + T_out + RH_out + T_stp_cool + T_stp_heat + Humi,
    data=DT, family=binomial(link="logit")
  )
  result <- summary(result)$coefficients
  return(result)
})

ADD COMMENT • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Worked perfectly!

I have another question. One of my data columns is called "fan" and it contains values that show the fan speed. I want to make this column to a binary column (whether the fan is on or off). How can I change the values greater than 0 to 1 and leave the zeros unchanged in all of my CSV files? Thank you

ADD REPLY • link 3.6 years ago by paramount.amin • 0

1

Entering edit mode

I've updated the answer to add this step. In R TRUE == 1 and FALSE == 0, so you can do as.numeric(fan > 0) in this case to turn any value above 0 to 1.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Thank you!

In general, if I want to replace a specific value in a column for all of my CSV files, what r code should I use? ( for example, one of my columns is Event and this column contains character values like HOLD, smart away, .. and also some empty values. How can I replace the HOLD values with 1 and all the other values to zero and delete empty rows for all the CSV files in the folder?

ADD REPLY • link 3.6 years ago by paramount.amin • 0

1

Entering edit mode

For your 'Event' example, you can more or less do the same thing as we did for fan. as.numeric(Event == "HOLD") will convert "HOLD" to 1, and everything else to 0. For more complicated replacements there are some convenient functions from the tidyverse such as case_when and fct_collapse, so it will depend on exactly what you want to do.

As for your question about empty rows, do you mean rows that are all just NA values?

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

When I ran this code for two CSV files it worked but when I try to apply this code to my entire CSV files, after elapsing some times, the analysis stops with an error that says: Error: cannot allocate vector of size 788 Kb how can I fix it?

ADD REPLY • link 3.6 years ago by paramount.amin • 0

0

Entering edit mode

That error message is telling you R ran out of memory. How many files are you processing, and how big are the files themselves? It's possible that you can't fit the model of your regression in memory for every file. There are ways to work around that if that's the case.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

I am working with 13748 CSV files and the total size is 70 GB. My laptop RAM is 8 GB and the current folder that I am working have 112 GB free space

ADD REPLY • link 3.6 years ago by paramount.amin • 0

0

Entering edit mode

So you definitely can't fit all the models in memory at once. Do you need anything besides the p-values from the models?

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

P values are the most important value that I am looking for. However, I need the coefficients of regression. Is there any way to extract coefficients and P-values?

ADD REPLY • link 3.6 years ago by paramount.amin • 0

1

Entering edit mode

The code is updated so it returns only a minimal summary table that has coefficients, p-values, and a few other things.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

In this question, I tried to investigate the impact of different parameters on the fan using logistic regression for each file separately. How can I first bind all of my 13748 CSV files into one file in R and perform logistic regression on this file considering that my laptop memory is 8 GB? (I need R-square value, P-value, coefficients)

ADD REPLY • link 3.6 years ago by paramount.amin • 0

1

Entering edit mode

This is a more difficult question to answer. You'll likely need to use a relational database like SQL, and/or a method such as batch gradient descent. I would suggest making this a new question to get a more detailed response.

ADD REPLY • link 3.6 years ago by rpolicastro 13k