How to write it in R language?
1
0
Entering edit mode
4 months ago
Farah ▴ 70

Hello,

I was wondering if I can have your help and support for my following question: I have a raw excel file including 417 columns and 8900 rows. The first five columns contain different protein IDs. Column number 6 to 417 (end) contain different experiments. Also, 8900 rows contain proteins (each row is for one individual protein). Column number 6 to 417 is like a matrix which each cell contains a number ranging from 0 to 2000. A snapshot of my dataset is attached at the end of my post.

Now, for each individual raw (protein), I need to calculate as following:

1) First, for columns number 6 to 417, I need to find those columns (experiment) which all their rows are zero.

2) I need to remove such columns with all rows in zero.

3) Then, for each individual raw (protein) except the row header, I need to calculate “number of columns (experiments; from column 6 to 417) with non zero count, and name this column as “Found_experiments” in its header. For example, if a raw/protein has only three columns/experiments with counts of 20, 850, and 300, it should return 3 (not sum of 20, 850, and 300) as only three columns has numbers for that protein.

4) Then, for each row (except header row), I need to calculate “number of columns found with non zero count” from column 6 to 417 (as found in previous step) divided to total number of columns from 6 to 417, and name this column as “Ratio” in its header.

5) Finally, it returns a separate excel file including the same first to five columns (protein IDs) as the original dataset, then with only two extra columns containing the calculations from the steps 3 (name as “Found_experiments”) and 4 (name as “Ratio”).

I need to write the codes for all the above steps in R. As I have basic knowledge in R, I would highly appreciate if you could help me to sort it out. Many thanks for your great help, in advance.

Best regards, Farah

R function programming coding • 285 views
1
Entering edit mode

it's hard to understand without seeing a snapshot of your data.

0
Entering edit mode

Thanks for your comment. I have attached a snapshot of my data at the end of my post. However, it contains row number 8900 and cc columns up to cc412.

0
Entering edit mode

What have you tried?

0
Entering edit mode

if you are learning R, consider doing some tutorials first

if you need this for your work/task/thesis, why not using Excel directly?

0
Entering edit mode

Thanks for your suggestion. I was wondering if I can have something as below using Excel directly.

           cc1 cc2  cc3 cc4     Found_experiments           Ratio
gene1   a   3   0   6   0             2                     2 ÷ 4
gene2   b   0   0   0   23            1                     1 ÷ 4
gene3   c   32  8   15  0             3                     3 ÷ 4
gene4   d   7   12  32  54            4                     4 ÷ 4
gene5   e   9   0   0   0             1                     1 ÷ 4


Thank you.

1
Entering edit mode

check the COUNTIF function

0
Entering edit mode

Great. Thanks a lot.

1
Entering edit mode
4 months ago
evolozzy ▴ 80
1. Export your data as CSV.
2. Import CSV to R with df <- read.csv("filename")
3. Make a list containing the names of first 5 columns ListOfTheNamesOfTheFirstFiveColumns <- c("geneSymbol",......)
4. Melt your data frame melted<-melt(df, id=ListOfTheNamesOfTheFirstFiveColumns,variable.name="Variable",value.name="Count")
5. Subset melted data frame subsetted<-subset(melted, Count!=0)
6. Calculate the number you ask in 3 output.df<-ddply(subsetted,ListOfTheNamesOfTheFirstFiveColumns,summarise, Found_experiments=length(Count))
7. Calculate the number you ask in 4 Ratio=output.df$Found_experiments/sum(output.df$Found_experiments)
8. Combine them output.df<-cbind(output.df,Ratio)
9. Write it to a file write.csv(output.df,"newfilename")
0
Entering edit mode

Dear evolozzy, Thank you so much for your great guide. I am running your codes, but got Error when running:

newColumn<-as.data.frame(Ratio=output.df$Found_experiments/sum(output.df$Found_experiments))
Error in as.data.frame(Ratio = output.df$Found_experiments/sum(output.df$Found_experiments)) :
argument "x" is missing, with no default


Also, the numbers for "Found_experiments" column looks strange (all of them are in the below range):

> output.df$Found_experiments  [1] 330 337 335 342 342 340 341 315 331 342 338 342 331 342 341 342 342 340 341 341 340 342 342 340 339 341 339 339 342 340 341 341 342 337 342 325 315 300 338 340 [41] 340 337 342 342 342 328 335 323 342 342 338 339 329 298 320 342 342 340 334 338 325 316 308 332 330 340 342 342 342 304 340 331 340 335 340 327 341 341 340 342 [81] 330 342 342 341 340 342 342 341 330 333 342 342 314 335 290 301 340 328 315 318 325 331 342 342 342 342 342 342 342 342 342 341 342 342 340 340 342 341 342 342 While many of them should be 2, 5, 10, .... Moreover, running >sum(output.df$Found_experiments) gave me [1] 2896905, while it should be 412 (if there was not any columns with all zero cells). As in section 4, I asked for “number of columns found with non zero count” from column 6 to 417 divided to number of columns from 6 to 417. By this, I mean, for each row, summing the number of cells which are non zero, not their contents. I need something like:

           cc1 cc2  cc3 cc4     Found_experiments           Ratio
gene1   a   3   0   6   0             2                     2 ÷ 4
gene2   b   0   0   0   23            1                     1 ÷ 4
gene3   c   32  8   15  0             3                     3 ÷ 4
gene4   d   7   12  32  54            4                     4 ÷ 4
gene5   e   9   0   0   0             1                     1 ÷ 4


1
Entering edit mode

I've found some errors and edited the answer.

I still don't understand what the Ratio is. Is it the number of non-zero cells divided by the total possible cells?

0
Entering edit mode

Many thanks for your great help. Yes, the ratio is the number of non-zero cells divided by the total number of cells per row. Thanks a lot.