sum of values in a column
1
0
Entering edit mode
3.2 years ago
pt.taklifi ▴ 60

I have a list named BRCA.b.1

seqnames start   end width strand   score.x   name score.y annotation percentGC percentAT
1     chr1     1  9999  9999      *  0.000000 BRCA_1 1.59268     Distal 0.3852295 0.3592814
2     chr1 10000 10099   100      * 17.716522 BRCA_1 1.59268     Distal 0.3852295 0.3592814
3     chr1 10100 10199   100      * 30.601267 BRCA_1 1.59268     Distal 0.3852295 0.3592814
4     chr1 10200 10299   100      *  9.663558 BRCA_1 1.59268     Distal 0.3852295 0.3592814
5     chr1 10300 10399   100      *  4.831779 BRCA_1 1.59268     Distal 0.3852295 0.3592814
6     chr1 10400 10499   100      *  8.052965   <NA>      NA       <NA>        NA        NA

and I have another list with distinct values of column name called BRCA for each specific name in the column name I want to calculate the sum of scores for corresponding rows and store them in BRCA$coverage. so for the first value in column name that is BRCA_1 I would like to get 62.81313

so far I wrote this code and it works right

for ( i in 1: nrow(BRCA))
{
BRCA$coverage[i]<-  sum(BRCA.b.1[which(BRCA.b.1$name== BRCA$name[i]), 6])
}

but because BRCA.b.1 is a big list it takes a long time to run

can you suggest a more efficient way to do this ?

r • 688 views
ADD COMMENT
0
Entering edit mode

with tsv-utils:

$ cat list1.txt 

seqnames    start   end width   strand  score.x name    score.y annotation  percentGC   percentAT
chr1    1   9999    9999    *   0.000000    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10000   10099   100 *   17.716522   BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10100   10199   100 *   30.601267   BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10200   10299   100 *   9.663558    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10300   10399   100 *   4.831779    BRCA_1  1.59268 Distal  0.3852295   0.3592814
chr1    10400   10499   100 *   8.052965    <NA>    NA  <NA>    NA  NA

$ cat list2.txt 

old_name    new_name
BRCA_1  BRCA

output:

$ tsv-join -H -f list2.txt -k 1 -d 7  list1.txt -a 2 | tsv-summarize -H --sum score.x:BRCA_average

BRCA_average
62.813126
ADD REPLY
2
Entering edit mode
3.2 years ago
Ram 43k

This would benefit from dplyr, although there's definitely a base R solution.

library(dplyr)
BRCA.b.1 %>% filter(name %in% BRCA$name) %>% group_by(name) %>% summarise(total_score = sum(score), .groups = 'keep')
ADD COMMENT
0
Entering edit mode

thank you @_r_am for your response I tried the command you mentioned but I get the following error

Error: Problem with `summarise()` input `total_score`.
x invalid 'type' (closure) of argument
ℹ Input `total_score` is `sum(score)`.
ℹ The error occurred in group 1: name = "BRCA_1".
Run `rlang::last_error()` to see where the error occurred.
ADD REPLY
2
Entering edit mode

Usually it occurs when the dataframe passed to summarise() has no column named 'score' but there is a 'score' function available in your environment. In this case code does not throws an error like "that there is no such column" and dies trying to compute sum of a function 'score' instead. I conclude from your example above that it should really be sum(score.x).

ADD REPLY
0
Entering edit mode

thank you this worked very well and fast!

ADD REPLY

Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6