Question

How to apply lapply function and loop function at the same time

0

Entering edit mode

2.3 years ago

Phylicia ▴ 10

Hello！I have a problem in which I want to combine lappy function（corresponding more than 10 phenotypes）with loop function（I have 7 files and I will select which is the best-fit corresponding one of the previous phenotypes).

So my strategy is to use lappy combined with loop function. But I do not know how. Hope you could help me! Thank you.

I try to use two loop functions but if fails. Because in loop, for（n in 100), the position of 100 must be a number and it could not be 100 values（characters--corresponding to 100 phenotypes）.

The loop function (the second loop) for best-fit was referred to the PRS tutorial (https://choishingwan.github.io/PRS-Tutorial/plink/). The 100 phenotypes are corresponding to the dependent variable in the best-fit choosing script. It means I need to do 100 times regression to choose one of seven to be the best-fit for each phenotype.

enter image description here

Here is my code.

#read the 100 phenotypes's names
pheno_name <- read.table("XXX.csv", header=F, sep =",")
pheno_name <- pheno_name[c(1),c(4:103)] ###M: Overwrites pheno_name without using the previous value; what is nmr_name?
name_value <- as.character(pheno_name[1,])


for(phenotype_n in name_value){
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)  ###M: This doesn't change in the loop, can be defined before, once

# Read in the phenotype file 
phenotype <- cf[,c("Batch","phenotype_n","NewID")] ###M: Same here; what is cf?

# Read in the PCs
pcs <- read.table("EUR.eigenvec", header=F) ###M: Just restoring the original value from the file, could it be read only once, then make a copy?

# The default output from plink does not include a header
# To make things simple, we will add the appropriate headers
# (1:6 because there are 6 PCs)
colnames(pcs) <- c("FID", "IID", paste0("PC",1:6)) 
# To adjust with our dataset, it needs to change the format of IID
pcs <- pcs %>% mutate(NewID = gsub('.*-([0-9]+).*','\\1',prs$IID))


# Read in the covariates (here, it is sex)
covariate <- cov_new3[,c(5,18,22,23)] ###M: what is cov_new3? Is that constant?

# Now merge the files
pheno <- merge(merge(phenotype, covariate, by=c("NewID", "NewID")), pcs, by=c("NewID","NewID"))

pheno <- pheno[,c(1,2,3,4,6,7,9,10,11,12,13,14)]

# We can then calculate the null model (model with PRS) using a linear regression 
# (as Total-C is quantitative)
#The adjustment of tutorial only adjust PCs and sex
#Here we additionally adjust Batch and age
null.model <- lm(phenotype_n~., data=pheno[,!colnames(pheno)%in%c("FID","NewID")])

# And the R2 of the null model is 
null.r2 <- summary(null.model)$r.squared

prs.result <- NULL
###M: can we assume what follows works correctly because it is from the tutorial?
for(i in p.threshold){
  # Go through each p-value threshold
  prs <- read.table(paste0("EAS.",i,".profile"), header=T)
  #Mutate 6 digit New ID
  prs <- prs %>% mutate(NewID = gsub('.*-([0-9]+).*','\\1',prs$IID))
  # Merge the prs with the phenotype matrix
  # We only want the FID, IID and PRS from the PRS file, therefore we only select the 
  # relevant columns
  pheno.prs <- merge(pheno, prs[,c("FID","NewID", "SCORE")], by=c("FID", "NewID"))
  # Now perform a linear regression on Height with PRS and the covariates
  # ignoring the FID and IID from our model
  model <- lm(phenotype_n~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","NewID")])
  # model R2 is obtained as 
  model.r2 <- summary(model)$r.squared
  # R2 of PRS is simply calculated as the model R2 minus the null R2
  prs.r2 <- model.r2-null.r2
  # We can also obtain the coeffcient and p-value of association of PRS as follow
  prs.coef <- summary(model)$coeff["SCORE",]
  prs.beta <- as.numeric(prs.coef[1])
  prs.se <- as.numeric(prs.coef[2])
  prs.p <- as.numeric(prs.coef[4])
  # We can then store the results
  prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}
# Best result is:
  write.path <- paste("C:\\best\\","phenotype_n",sep="")
  dir.create(paste("C:\\best\\","phenotype_n",sep=""))
  write.table(prs.result[which.max(prs.result$R2),],write.path)
}

loop lappy Plink • 2.0k views

ADD COMMENT • link 2.2 years ago by Phylicia ▴ 10

2

Entering edit mode

you can write a function for either one and call it.

> I try to use two loop functions but if fails.

Please post error.

It would be helpful to post some example data and also a simplified version of your code. Long codes are difficult to troubleshoot and needs extra time. In addition, please do not post images of data and/or code. It is not helpful. In addition, posted code image of low resolution.

ADD REPLY • link 2.3 years ago by cpad0112 21k

0

Entering edit mode

I will try to rephrase my question.

The screenshot I paste is the right code and it is a loop function in which it selects the best-fit file from 7 files based on the regression results. One of the phenotypes will be the dependent variable of the regression.

The loop function has already contained the information of "selecting one from 7 when going through all of the 7 files". Could I also add a lappy function, ask to go through 100 phenotypes, and for each phenotype I could choose the best-fit one from 7?

My original code is too long, so I just paste the right code from the tutorial to show the right loop function.

Thanks a lot！

ADD REPLY • link 2.3 years ago by Phylicia ▴ 10

1

Entering edit mode

You might have to explain what you're trying to do more simply. You can certainly use a loop within lapply(), and use lapply() within a loop. e.g.:

# for a bunch of phenotypes in a list, and a vector of file names
lapply(phenotypes, function(p){
      for(i in seq_along(files)){
         # do something for phenotype: p and file: files[i]
      }
})

When you say:

in loop, for(n in 100), the position of 100 must be a number

an easy way to ensure you have numbers in your loop is to use the seq_along() function to iterate over a vector of positions.

ADD REPLY • link 2.3 years ago by seidel 11k

0

Entering edit mode

Thanks a lot for your suggestion. Actually, I should do the same thing for more than 100 phenotypes. I do not know how to represent the 100 phenotypes in each loop function respectively. The phenotype will be a dependent variable for regression in the loop function. Could you give me more suggestions?

ADD REPLY • link 2.3 years ago by Phylicia ▴ 10

1

Entering edit mode

Please show us your code.

In case you really wrote for（n in 100) this will give you only one iteration with n=100. I guess you meant for (n in 1:100) which gives you 100 iterations on the sequence 1..100. Hope this helps, if not show your code as text.

ADD REPLY • link 2.3 years ago by Michael 54k

0

Entering edit mode

I have added my code which is a little bit long.

In the loop it will select the best-fit file from 7 files, but the selection process depends on a phenotype value. I have 100 phenotypes. So I want to try 100 phenotypes by lapply function (maybe?).

Thanks a lot and look forward to your kind help！

ADD REPLY • link 2.3 years ago by Phylicia ▴ 10

0

Entering edit mode

It's a little difficult to debug this without the input data. Also, your code uses some variables from the environment. Could you make a self-contained piece of code that could run without external files in an R --vanilla environment? Also, what is the exact error message including traceback()?

I will annotate your code with some comments where it is not clear to me what's happening. Look for comments like this ###M

ADD REPLY • link 2.3 years ago by Michael 54k

0

Entering edit mode

Thanks, I will try to prepare the data input and invite you to help a few days later.

ADD REPLY • link 2.3 years ago by Phylicia ▴ 10

0

Entering edit mode

Fine, it would definitely help if we could simply run the code to try to reproduce any error.

ADD REPLY • link 2.3 years ago by Michael 54k

1

Entering edit mode

Just two observations:

(1) name_value is essentially a row from XXX.csv (specifically, the first row including values from columns 4:103). This is your vector of phenotypes which you are trying to iterate over. You iterate over this in the first loop, getting the values one by one in the variable: phenotype_n. Correct?

(2) You only use phenotype_n two times. Each time in a call to lm(). The other times you use this name it is surrounded by quotes ("phenotype_n") which means it isn't being used as a variable. This is especially apparent in the last section when you try to save the Best Result. Your code for write.path() doesn't make sense as written because you're simply concatenating two strings ("C:\...\","phenotype_n"). I think you really mean to concatenate the phenotype name into the file path: paste("C:\\best\\", phenotype_n, sep=""). (no quotes around phenotype_n) Otherwise, you're simply overwriting the same file with each pass through the loop.

ADD REPLY • link updated 2.2 years ago by Michael 54k • written 2.3 years ago by seidel 11k

0

Entering edit mode

Thanks a lot！

ADD REPLY • link 2.2 years ago by Phylicia ▴ 10