I was wondering if I can have your help and support for my following question: I have a raw excel file including 417 columns and 8900 rows. The first five columns contain different protein IDs. Column number 6 to 417 (end) contain different experiments. Also, 8900 rows contain proteins (each row is for one individual protein). Column number 6 to 417 is like a matrix which each cell contains a number ranging from 0 to 2000. A snapshot of my dataset is attached at the end of my post.
Now, for each individual raw (protein), I need to calculate as following:
1) First, for columns number 6 to 417, I need to find those columns (experiment) which all their rows are zero.
2) I need to remove such columns with all rows in zero.
3) Then, for each individual raw (protein) except the row header, I need to calculate “number of columns (experiments; from column 6 to 417) with non zero count, and name this column as “Found_experiments” in its header. For example, if a raw/protein has only three columns/experiments with counts of 20, 850, and 300, it should return 3 (not sum of 20, 850, and 300) as only three columns has numbers for that protein.
4) Then, for each row (except header row), I need to calculate “number of columns found with non zero count” from column 6 to 417 (as found in previous step) divided to total number of columns from 6 to 417, and name this column as “Ratio” in its header.
5) Finally, it returns a separate excel file including the same first to five columns (protein IDs) as the original dataset, then with only two extra columns containing the calculations from the steps 3 (name as “Found_experiments”) and 4 (name as “Ratio”).
I need to write the codes for all the above steps in R. As I have basic knowledge in R, I would highly appreciate if you could help me to sort it out. Many thanks for your great help, in advance.
Best regards, Farah