Question

Collapse rows of dataframe by AGI transcript numbers to AGI gene number

1

Entering edit mode

5.7 years ago

mccormack ▴ 90

I have a data frame in which the first column contains unique identifiers in the form of Arabidopsis AGI transcript numbers.

    target_id   est_counts  tpm
    AT1G01010.1 236     26.3 
    AT1G01020.2 55.0    12.2  
    AT1G01020.6 0         0 
    AT1G01020.1 25.2    4.8
    AT1G01020.4 5.8     0.2 
    AT1G01020.5 45.5    8.8 
    AT1G01020.3 3.5     0.5
    AT1G01030.2 13.25   1.3 
    AT1G01030.1 17.75   1.7

This table has data for three genes; AT1G01010, AT1G01020 and AT1G01030. However, two genes, AT1G01020 and AT1G01030 have multiple transcripts as indicated by the number to the right of the decimal. I would like to collapse the above table into three entries, one entry for each gene in which each column will contain the total for all transcripts of that gene. So, the resulting data frame will look like this:

    target_id   est_counts  tpm
    AT1G01010   236         26.3 
    AT1G01020   135.0       26.5
    AT1G01030   31           3.0

Here, the values for est_counts and tpm for all transcripts of a single gene have been summed. I have seen things like lapply(.sd), but I don't think that will work because the AGI numbers are not exactly equal and my real table has thousands of different AGI numbers.

R • 1.4k views

ADD COMMENT • link updated 5.7 years ago by zx8754 12k • written 5.7 years ago by mccormack ▴ 90

1

Entering edit mode

5.7 years ago

ATpoint 88k

Assume your data us called data1, use aggregate():

## remove everything after the "."
data1$target_id <- sapply(strsplit(data1$target_id, split="\\."), function(x)x[1])

## sum using the aggregate() function, which reads as:
## sum up est_counts (or tpm) as a function of duplicated target_id entries:
results <- data.frame(aggregate(est_counts ~ target_id, data1, sum),
                      aggregate(tpm ~ target_id, data1, sum)[,2])

ADD COMMENT • link 5.7 years ago by ATpoint 88k

0

Entering edit mode

See answer of zx8754 apparently aggregate can operate on the entire data frame in one go, no need for two separate commands.

ADD REPLY • link 5.7 years ago by ATpoint 88k

score 3 · Accepted Answer · 2019-11-05

Clean up target_id column, then aggregate other multiple columns at once, see:

# reproducible data
df1 <- read.table(text = "target_id   est_counts  tpm
AT1G01010.1 236     26.3 
AT1G01020.2 55.0    12.2  
AT1G01020.6 0         0 
AT1G01020.1 25.2    4.8
AT1G01020.4 5.8     0.2 
AT1G01020.5 45.5    8.8 
AT1G01020.3 3.5     0.5
AT1G01030.2 13.25   1.3 
AT1G01030.1 17.75   1.7", header = TRUE, stringsAsFactors = FALSE)

# we can imagine they are filenames and drop "extensions":
df1$target_id_clean <- tools::file_path_sans_ext(df1$target_id)

# then aggregate and sum on multiple columns
aggregate(.~target_id_clean, df1[-1], sum)
#   target_id_clean est_counts  tpm
# 1       AT1G01010        236 26.3
# 2       AT1G01020        135 26.5
# 3       AT1G01030         31  3.0