In R: Print column names if value is greater than a value in a different column.
0
0
Entering edit mode
4.0 years ago
rjobmc • 0

In R, I am working with a huge file of DNA methylation data (beta values). I need to identify diseased individuals that have >0.33 methylation value against the average of our controls, per gene. The file (df.csv below) has individual methylation values per gene (A, B, C in example below) and the average of the controls (AveControl.value in the example below) is in the column at the end of the file. For ease, I would like to print the individual(s) (column name(s)) that have >0.33 vs controls in a new column at the end of the file. If there is more than one individual to print, the individual name should be separated by a comma. If no individual shows >0.33, then it should remain blank. Below I have given an example of my data and what I need as an output:

Example of my file: (I have added a space between commas so it's easier to see the data values) In bold I have highlighted the values for the individuals I am looking to extract

df.csv

Gene, Indiv1.value, Indiv2.value, Indiv3.value, AveControl.value

A, 0.1, 0.2, 0.5, 0.1

B, 0.1, 0.2, 0.2, 0.2

C, 0.1, 0.9, 0.8, 0.4

*Example of the output I require - with new column RESULT containing the column name of the individual with >0.33 methylation vs controls. If no individuals meet the requirement, the entry should be empty.

Gene, Indiv1.value, Indiv2.value, Indiv3.value, AveControl.value, RESULT

A, 0.1, 0.2, 0.5, 0.1 Indiv3.value

B, 0.1, 0.2, 0.2, 0.2

C, 0.1, 0.9, 0.8, 0.4 Indiv2.value, Indiv3.value

I have been trying to find a way to do this but I have already lost hours. Any help will be greatly appreciated. Btw, In reality, I have 100's of diseased individuals not just the 3 I am showing here. Many thanks in advance.

R data extraction • 4.3k views
ADD COMMENT
0
Entering edit mode

There is a code option that is recommended to highlight code. You can edit your post with the edit button.

https://i.ibb.co/yFWj21m/Screen-Shot-2019-10-01-at-10-49-40.png

ADD REPLY
0
Entering edit mode

you can use following solution if you are looking for column values more than 0.33 including average control:

> df=read.csv("test.txt", header = T, stringsAsFactors = F, strip.white = T)
> df
  Gene Indiv1.value Indiv2.value Indiv3.value AveControl.value
1    A          0.1          0.2          0.5              0.1
2    B          0.1          0.2          0.2              0.2
3    C          0.1          0.9          0.8              0.4
> df$new=apply(df[,-1],1,function(x) names(df[,-1])[which(x>0.33)])
> df
  Gene Indiv1.value Indiv2.value Indiv3.value AveControl.value                                          new
1    A          0.1          0.2          0.5              0.1                                 Indiv3.value
2    B          0.1          0.2          0.2              0.2                                             
3    C          0.1          0.9          0.8              0.4 Indiv2.value, Indiv3.value, AveControl.value

if you are looking for column names that compare all columns against last column (Avecontrol):

> df$new=apply(df[, 2:4] > df[, 5], 1, function(x) ifelse(any(x), paste(colnames(df)[2:4][x], collapse=', '), 'None'))
> df
  Gene Indiv1.value Indiv2.value Indiv3.value AveControl.value                        new
1    A          0.1          0.2          0.5              0.1 Indiv2.value, Indiv3.value
2    B          0.1          0.2          0.2              0.2                       None
3    C          0.1          0.9          0.8              0.4 Indiv2.value, Indiv3.value
ADD REPLY
0
Entering edit mode

Dear cpad0112, Many many thanks for your reply, it's a huge help.

I require something similar to your second command line (comparing all columns against the last column (Avecontrol)). I would like to compare all columns against the last column but only print those that are >0.33 than the last column.

ADD REPLY
0
Entering edit mode

The first solution does that. It prints all the individual names whose name column values are more than 0.33, in the last column

ADD REPLY
0
Entering edit mode

Sorry, I didn't explain properly. I only need those columns where the value is 0.33 above the AveControl.value column.

If AveControl.value is 0.5 and another column is 0.7, I do not need the AveControl.value (as the difference is less than 0.33). There must be a difference equal to or more than 0.33 as compared to the AveControl.value column value. I hope I have explained myself better!

Thanks so much.

ADD REPLY
0
Entering edit mode

got it...will update the solution soon...@ rjobmc

ADD REPLY
0
Entering edit mode

Thank you very much @ cpad0112.

ADD REPLY
1
Entering edit mode

My understanding is that only those column names with rows above 0.33 oon Average control are reported in a new column

$ df=read.csv("test.txt", header = T, stringsAsFactors = F, strip.white = T)
$ df
      Gene Indiv1.value Indiv2.value Indiv3.value AveControl.value
    1    A          0.1          0.2          0.5              0.1
    2    B          0.1          0.2          0.2              0.2
    3    C          0.1          0.9          0.8              0.4
$ df$new=apply(df[, 2:4] > df[, 5]+0.33, 1, function(x) ifelse(any(x), paste(colnames(df)[2:4][x], collapse=', '), ""))
$ df
      Gene Indiv1.value Indiv2.value Indiv3.value AveControl.value                        new
    1    A          0.1          0.2          0.5              0.1               Indiv3.value
    2    B          0.1          0.2          0.2              0.2                           
    3    C          0.1          0.9          0.8              0.4 Indiv2.value, Indiv3.value
ADD REPLY
0
Entering edit mode

Exactly.. thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 1817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6