Question

How to take sum of cells in a dataframe based on another column

0

Entering edit mode

6.5 years ago

lauren.fehrman • 0

I would like to determine the length of all mRNAs. I have downloaded a bed file of exons and split the column in R so that all exons belonging to the same transcript will have the same character in V4.

I have also calculated the length of each exon in V7

exons[c(1:30),] V1 V2 V3 V4 V5 V6 V7 V8 1 chr1 67092175 67093604 NR_075077.1 0 - 1429 _0_0_chr1_67092176_r 2 chr1 67096251 67096321 NR_075077.1 0 - 70 _1_0_chr1_67096252_r 3 chr1 67103237 67103382 NR_075077.1 0 - 145 _2_0_chr1_67103238_r 4 chr1 67111576 67111644 NR_075077.1 0 - 68 _3_0_chr1_67111577_r 5 chr1 67113613 67113756 NR_075077.1 0 - 143 _4_0_chr1_67113614_r 6 chr1 67115351 67115464 NR_075077.1 0 - 113 _5_0_chr1_67115352_r 7 chr1 67125751 67125909 NR_075077.1 0 - 158 _6_0_chr1_67125752_r 8 chr1 67127165 67127257 NR_075077.1 0 - 92 _7_0_chr1_67127166_r 9 chr1 67131141 67131227 NR_075077.1 0 - 86 _8_0_chr1_67131142_r 10 chr1 67134929 67134971 NR_075077.1 0 - 42 _9_0_chr1_67134930_r 11 chr1 67092175 67093604 NM_001276352.1 0 - 1429 _0_0_chr1_67092176_r 12 chr1 67096251 67096321 NM_001276352.1 0 - 70 _1_0_chr1_67096252_r 13 chr1 67103237 67103382 NM_001276352.1 0 - 145 _2_0_chr1_67103238_r 14 chr1 67111576 67111644 NM_001276352.1 0 - 68 _3_0_chr1_67111577_r 15 chr1 67115351 67115464 NM_001276352.1 0 - 113 _4_0_chr1_67115352_r 16 chr1 67125751 67125909 NM_001276352.1 0 - 158 _5_0_chr1_67125752_r 17 chr1 67127165 67127257 NM_001276352.1 0 - 92 _6_0_chr1_67127166_r 18 chr1 67131141 67131227 NM_001276352.1 0 - 86 _7_0_chr1_67131142_r 19 chr1 67134929 67134971 NM_001276352.1 0 - 42 _8_0_chr1_67134930_r 20 chr1 67092175 67093604 NM_001276351.1 0 - 1429 _0_0_chr1_67092176_r 21 chr1 67095234 67095421 NM_001276351.1 0 - 187 _1_0_chr1_67095235_r 22 chr1 67096251 67096321 NM_001276351.1 0 - 70 _2_0_chr1_67096252_r 23 chr1 67115351 67115464 NM_001276351.1 0 - 113 _3_0_chr1_67115352_r 24 chr1 67125751 67125909 NM_001276351.1 0 - 158 _4_0_chr1_67125752_r 25 chr1 67127165 67127257 NM_001276351.1 0 - 92 _5_0_chr1_67127166_r 26 chr1 67131141 67131227 NM_001276351.1 0 - 86 _6_0_chr1_67131142_r 27 chr1 67134929 67134971 NM_001276351.1 0 - 42 _7_0_chr1_67134930_r 28 chr1 67092175 67093604 XM_011541469.1 0 - 1429 _0_0_chr1_67092176_r 29 chr1 67095234 67095421 XM_011541469.1 0 - 187 _1_0_chr1_67095235_r 30 chr1 67096251 67096321 XM_011541469.1 0 - 70 _2_0_chr1_67096252_r

What I want to do is sum all of the values of V7 where V4 values are equal

So where V4 is equal, like it is below, add all the values in V7 to get 2,346

V4 V5 V6 V7 V8
NR_075077.1 0 - 1429 _0_0_chr1_67092176_r NR_075077.1 0 - 70 _1_0_chr1_67096252_r NR_075077.1 0 - 145 _2_0_chr1_67103238_r NR_075077.1 0 - 68 _3_0_chr1_67111577_r NR_075077.1 0 - 143 _4_0_chr1_67113614_r NR_075077.1 0 - 113 _5_0_chr1_67115352_r NR_075077.1 0 - 158 _6_0_chr1_67125752_r NR_075077.1 0 - 92 _7_0_chr1_67127166_r NR_075077.1 0 - 86 _8_0_chr1_67131142_r NR_075077.1 0 - 42 _9_0_chr1_67134930_r

R • 1.3k views

ADD COMMENT • link updated 6.5 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 6.5 years ago by lauren.fehrman • 0

0

Entering edit mode

This is completely unreadable - could you either make a (much) smaller example or put the data in a google sheet?

ADD REPLY • link 6.5 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

I have put a hypothetical example below

ADD REPLY • link 6.5 years ago by lauren.fehrman • 0

score 0 · Answer 1 · 2017-10-19

My apologies. I didn't realize that the formatting would get so messed up. Essentially, what I a trying to is get the length of all transcripts, given the lengths of their exons.

Input:

transcript ID length

a...................... 2

a...................... 3

a...................... 4

a ......................5

a ......................6

b ......................1

b ......................4

b...................... 7

c ......................2

c ......................4

c ......................7

Desired Output

a .................... 20

b......................12

c......................13

score 0 · Answer 2 · 2017-10-19

0

Entering edit mode

6.5 years ago

Kristoffer Vitting-Seerup ★ 4.0k

The easiest is to split the data.frame into a list and use an apply family function to sum.

Here is a code snippit: txLen <- sapply( split(df$length, df$transcript_id), sum )

ADD COMMENT • link 6.5 years ago by Kristoffer Vitting-Seerup ★ 4.0k