Question: Collapse rows of dataframe by AGI transcript numbers to AGI gene number
1
gravatar for mccormack
9 months ago by
mccormack40
United States
mccormack40 wrote:

I have a data frame in which the first column contains unique identifiers in the form of Arabidopsis AGI transcript numbers.

    target_id   est_counts  tpm
    AT1G01010.1 236     26.3 
    AT1G01020.2 55.0    12.2  
    AT1G01020.6 0         0 
    AT1G01020.1 25.2    4.8
    AT1G01020.4 5.8     0.2 
    AT1G01020.5 45.5    8.8 
    AT1G01020.3 3.5     0.5
    AT1G01030.2 13.25   1.3 
    AT1G01030.1 17.75   1.7

This table has data for three genes; AT1G01010, AT1G01020 and AT1G01030. However, two genes, AT1G01020 and AT1G01030 have multiple transcripts as indicated by the number to the right of the decimal. I would like to collapse the above table into three entries, one entry for each gene in which each column will contain the total for all transcripts of that gene. So, the resulting data frame will look like this:

    target_id   est_counts  tpm
    AT1G01010   236         26.3 
    AT1G01020   135.0       26.5
    AT1G01030   31           3.0

Here, the values for est_counts and tpm for all transcripts of a single gene have been summed. I have seen things like lapply(.sd), but I don't think that will work because the AGI numbers are not exactly equal and my real table has thousands of different AGI numbers.

R • 201 views
ADD COMMENTlink modified 9 months ago by zx87549.4k • written 9 months ago by mccormack40
3
gravatar for zx8754
9 months ago by
zx87549.4k
London
zx87549.4k wrote:

Clean up target_id column, then aggregate other multiple columns at once, see:

# reproducible data
df1 <- read.table(text = "target_id   est_counts  tpm
AT1G01010.1 236     26.3 
AT1G01020.2 55.0    12.2  
AT1G01020.6 0         0 
AT1G01020.1 25.2    4.8
AT1G01020.4 5.8     0.2 
AT1G01020.5 45.5    8.8 
AT1G01020.3 3.5     0.5
AT1G01030.2 13.25   1.3 
AT1G01030.1 17.75   1.7", header = TRUE, stringsAsFactors = FALSE)

# we can imagine they are filenames and drop "extensions":
df1$target_id_clean <- tools::file_path_sans_ext(df1$target_id)

# then aggregate and sum on multiple columns
aggregate(.~target_id_clean, df1[-1], sum)
#   target_id_clean est_counts  tpm
# 1       AT1G01010        236 26.3
# 2       AT1G01020        135 26.5
# 3       AT1G01030         31  3.0
ADD COMMENTlink written 9 months ago by zx87549.4k

Great ! Does the job very well. Thank you.

ADD REPLYlink written 9 months ago by mccormack40
1
gravatar for ATpoint
9 months ago by
ATpoint36k
Germany
ATpoint36k wrote:

Assume your data us called data1, use aggregate():

## remove everything after the "."
data1$target_id <- sapply(strsplit(data1$target_id, split="\\."), function(x)x[1])

## sum using the aggregate() function, which reads as:
## sum up est_counts (or tpm) as a function of duplicated target_id entries:
results <- data.frame(aggregate(est_counts ~ target_id, data1, sum),
                      aggregate(tpm ~ target_id, data1, sum)[,2])
ADD COMMENTlink modified 9 months ago • written 9 months ago by ATpoint36k

See answer of zx8754 apparently aggregate can operate on the entire data frame in one go, no need for two separate commands.

ADD REPLYlink written 9 months ago by ATpoint36k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1469 users visited in the last hour