Question: Reading Multiple CSV files and perform a logistic regression for all those files separately
0
gravatar for paramount.amin
7 weeks ago by
paramount.amin0 wrote:

I am trying to perform a logistic regression analysis on multiple CSV files. All the files have the same column's name ( DateTime, HvacMode, Event, Schedule,...). How can I load all the CSV files in the folder and perform logistic regression using one Rstudio code for all the CSV files and I will be able to refer to each analysis's result in the future. My goal is to find a pattern in different CSV samples.

for a single file I am using:

glm (Event ~ T_ctrl + T_out+RH_out+ T_stp_cool+T_stp_heat+Humi, data=logistic, family=binomial(link="logit"))

and I need to do this for all the csv files in WD.

Thank you

R • 289 views
ADD COMMENTlink modified 6 weeks ago by zx87549.7k • written 7 weeks ago by paramount.amin0
3
gravatar for rpolicastro
7 weeks ago by
rpolicastro2.0k
rpolicastro2.0k wrote:

base R answer

# Replace '...' with the path to the directory with your files.
files <- list.files(..., pattern="\\.csv$", full.names=TRUE)
files <- setNames(files, basename(files))

results <- lapply(files,  function(x) {
  df <- read.csv(x, stringsAsFactors=FALSE)
  df$fan <- as.numeric(df$fan > 0)
  result <- glm(
    Event ~ T-ctrl + T_out + RH_out + T_stp_cool + T_stp_heat + Humi,
    data=df, family=binomial(link="logit")
  )
  result <- summary(result)$coefficients
  return(result)
})

data.table will speed things up a bit if the files are big and/or you have a lot of files.

library("data.table")

# Replace '...' with the path to the directory with your files.
files <- list.files(..., pattern="\\.csv$", full.names=TRUE)
files <- setNames(files, basename(files))

results <- lapply(files,  function(x) {
  DT <- fread(x, sep=",")
  set(DT, j="fan", value=as.numeric(DT[, fan] > 0))
  result <- glm(
    Event ~ T-ctrl + T_out + RH_out + T_stp_cool + T_stp_heat + Humi,
    data=DT, family=binomial(link="logit")
  )
  result <- summary(result)$coefficients
  return(result)
})
ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by rpolicastro2.0k

Worked perfectly!

I have another question. One of my data columns is called "fan" and it contains values that show the fan speed. I want to make this column to a binary column (whether the fan is on or off). How can I change the values greater than 0 to 1 and leave the zeros unchanged in all of my CSV files? Thank you

ADD REPLYlink written 7 weeks ago by paramount.amin0
1

I've updated the answer to add this step. In R TRUE == 1 and FALSE == 0, so you can do as.numeric(fan > 0) in this case to turn any value above 0 to 1.

ADD REPLYlink modified 6 weeks ago • written 7 weeks ago by rpolicastro2.0k

Thank you!

In general, if I want to replace a specific value in a column for all of my CSV files, what r code should I use? ( for example, one of my columns is Event and this column contains character values like HOLD, smart away, .. and also some empty values. How can I replace the HOLD values with 1 and all the other values to zero and delete empty rows for all the CSV files in the folder?

ADD REPLYlink written 6 weeks ago by paramount.amin0
1

For your 'Event' example, you can more or less do the same thing as we did for fan. as.numeric(Event == "HOLD") will convert "HOLD" to 1, and everything else to 0. For more complicated replacements there are some convenient functions from the tidyverse such as case_when and fct_collapse, so it will depend on exactly what you want to do.

As for your question about empty rows, do you mean rows that are all just NA values?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by rpolicastro2.0k

When I ran this code for two CSV files it worked but when I try to apply this code to my entire CSV files, after elapsing some times, the analysis stops with an error that says: Error: cannot allocate vector of size 788 Kb how can I fix it?

ADD REPLYlink written 6 weeks ago by paramount.amin0

That error message is telling you R ran out of memory. How many files are you processing, and how big are the files themselves? It's possible that you can't fit the model of your regression in memory for every file. There are ways to work around that if that's the case.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by rpolicastro2.0k

I am working with 13748 CSV files and the total size is 70 GB. My laptop RAM is 8 GB and the current folder that I am working have 112 GB free space

ADD REPLYlink written 6 weeks ago by paramount.amin0

So you definitely can't fit all the models in memory at once. Do you need anything besides the p-values from the models?

ADD REPLYlink written 6 weeks ago by rpolicastro2.0k

P values are the most important value that I am looking for. However, I need the coefficients of regression. Is there any way to extract coefficients and P-values?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by paramount.amin0
1

The code is updated so it returns only a minimal summary table that has coefficients, p-values, and a few other things.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by rpolicastro2.0k

In this question, I tried to investigate the impact of different parameters on the fan using logistic regression for each file separately. How can I first bind all of my 13748 CSV files into one file in R and perform logistic regression on this file considering that my laptop memory is 8 GB? (I need R-square value, P-value, coefficients)

ADD REPLYlink written 6 weeks ago by paramount.amin0
1

This is a more difficult question to answer. You'll likely need to use a relational database like SQL, and/or a method such as batch gradient descent. I would suggest making this a new question to get a more detailed response.

ADD REPLYlink written 6 weeks ago by rpolicastro2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1099 users visited in the last hour