Question: How to create a summary statistics data table for omics data?
1
gravatar for Wuschel
4 weeks ago by
Wuschel120
HUJI
Wuschel120 wrote:

Hi, I have a big data frame for omics data. Samples are named as Genotype_Time_Replicate (e.g. AOX_1h_4). Each sample has 4 replicates for each time point.

E.g. data set

df <- structure(list(AGI = c("ATCG01240", "ATCG01310", "ATMG00070"), aox2_0h__1 = c(15.79105291, 14.82652303, 14.70630068), aox2_0h__2 = c(16.06494674, 14.50610036, 14.52189807), aox2_0h__3 = c(14.64596287, 14.73266459, 13.07143141), aox2_0h__4 = c(15.71713641, 15.15430026, 16.32190068 ), aox2_12h__1 = c(14.99030606, 15.08046949, 15.8317372), aox2_12h__2 = c(15.15569857, 14.98996474, 14.64862254), aox2_12h__3 = c(15.12144791, 14.90111092, 14.59618842), aox2_12h__4 = c(14.25648197, 15.09832061, 14.64442686), aox2_24h__1 = c(15.23997241, 14.80968391, 14.22573239 ), aox2_24h__2 = c(15.57551513, 14.94861669, 15.18808897), aox2_24h__3 = c(15.04928714, 14.83758685, 13.06948037), aox2_24h__4 = c(14.79035385, 14.93873234, 14.70402827), aox5_0h__1 = c(15.8245918, 14.9351844, 14.67678306), aox5_0h__2 = c(15.75108628, 14.85867002, 14.45704948 ), aox5_0h__3 = c(14.36545859, 14.79296855, 14.82177912), aox5_0h__4 = c(14.80626019, 13.43330964, 16.33482718), aox5_12h__1 = c(14.66327372, 15.22571466, 16.17761867), aox5_12h__2 = c(14.58089039, 14.98545497, 14.4331578), aox5_12h__3 = c(14.58091828, 14.86139511, 15.83898617 ), aox5_12h__4 = c(14.48097297, 15.1420725, 13.39369381), aox5_24h__1 = c(15.41855602, 14.9890092, 13.92629626), aox5_24h__2 = c(15.78386057, 15.19372889, 14.63254456), aox5_24h__3 = c(15.55321382, 14.82013321, 15.74324956), aox5_24h__4 = c(14.53085803, 15.12196994, 14.81028556 ), WT_0h__1 = c(14.0535031, 12.45484834, 14.89102226), WT_0h__2 = c(13.64720361, 15.07144643, 14.99836235), WT_0h__3 = c(14.28295759, 13.75283646, 14.98220861), WT_0h__4 = c(14.79637443, 15.1108037, 15.21711524 ), WT_12h__1 = c(15.05711898, 13.33689777, 14.81064042), WT_12h__2 = c(14.83846779, 13.62497318, 14.76356308), WT_12h__3 = c(14.77215863, 14.72814995, 13.0835214), WT_12h__4 = c(14.70685445, 14.98527337, 16.12727292), WT_24h__1 = c(15.43813077, 14.56918572, 14.92146565 ), WT_24h__2 = c(16.05986898, 14.70583866, 15.64566505), WT_24h__3 = c(14.87721853, 13.22461859, 16.34119942), WT_24h__4 = c(14.92822133, 14.74382383, 12.79146694)), class = "data.frame", row.names = c(NA, -3L))

Please bear with me. I have to summarize the data for each time point; Mean, SE and do a multiple comparison (t-test; i.e. WT-aox2, WT-aox5, aox2-aox5). Then create a table as below figure.

Picture1

My real df has more genotypes and time points, so difficult to work in Excel.

How can I do this in R? Could anyone help me with this?

statistics rna-seq R proteomics • 286 views
ADD COMMENTlink modified 4 weeks ago by Chirag Parsania1.4k • written 4 weeks ago by Wuschel120
1

What have you tried? Did you do any basic tutorials for R? No offense, but if I search google I can find really a lot of tutorials in R programming including basic functions such as mean calculations, etc.

ADD REPLYlink written 4 weeks ago by b.nota6.4k
1

Also please elaborate on what you actually want to do (why do you want to calculate a t-test?) as we might know better ways of doing it :-)

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup1.6k

Hi Kristoffer, it doesn't have to be t-test. May be Posthoc multiple comparision also fine. This is the table format my supervisor preferred. Would be great if you could help me with his.

ADD REPLYlink written 4 weeks ago by Wuschel120
1

A good starting point would be to reshape your data from wide-to-long.

ADD REPLYlink written 4 weeks ago by zx87546.8k

Hello BIOAWY!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/54764591/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum118k
2
gravatar for Chirag Parsania
4 weeks ago by
Chirag Parsania1.4k
University of Macau
Chirag Parsania1.4k wrote:

I showed few tactics to simplify and visualise data in R using tidyverse You can explore more by taking this as start point.

library(tidyverse)
ss <- df %>% 
as_tibble() %>% 
gather(key = "cond" , value = "value" , -AGI) %>%  ## wide to long format
separate(cond , into = c("Genotype", "Time", "Replicate") , sep = "_+") ## separate each attribute 


# A tibble: 108 x 5
   AGI       Genotype Time  Replicate value
   <chr>     <chr>    <chr> <chr>     <dbl>
 1 ATCG01240 aox2     0h    1          15.8
 2 ATCG01310 aox2     0h    1          14.8
 3 ATMG00070 aox2     0h    1          14.7
 4 ATCG01240 aox2     0h    2          16.1
 5 ATCG01310 aox2     0h    2          14.5
 6 ATMG00070 aox2     0h    2          14.5
 7 ATCG01240 aox2     0h    3          14.6
 8 ATCG01310 aox2     0h    3          14.7
 9 ATMG00070 aox2     0h    3          13.1
10 ATCG01240 aox2     0h    4          15.7
# … with 98 more rows


## average out replicates 
ss_m <- ss %>% group_by(AGI, Genotype , Time) %>% summarise(replicates_mean = mean(value))

ss_m
# A tibble: 27 x 4
# Groups:   AGI, Genotype [?]
   AGI       Genotype Time  replicates_mean
   <chr>     <chr>    <chr>           <dbl>
 1 ATCG01240 aox2     0h               15.6
 2 ATCG01240 aox2     12h              14.9
 3 ATCG01240 aox2     24h              15.2
 4 ATCG01240 aox5     0h               15.2
 5 ATCG01240 aox5     12h              14.6
 6 ATCG01240 aox5     24h              15.3
 7 ATCG01240 WT       0h               14.2
 8 ATCG01240 WT       12h              14.8
 9 ATCG01240 WT       24h              15.3
10 ATCG01310 aox2     0h               14.8
# … with 17 more rows

comparing timepoint

bplot <- ss_m %>% ggplot() + geom_boxplot(aes(x = Time, y = replicates_mean , fill = Genotype)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot.png" ,plot = bplot)

boxplot

comparing genotype

bplot2 <- ss_m %>% ggplot() + geom_boxplot(aes(x = Genotype, y = replicates_mean , fill = Time)) +  theme_bw() + theme(text = element_text(size = 20))
ggsave(filename = "boxplot2.png" ,plot = bplot2)

boxplot2

======================Update ===============================

Convert the data in to the format you asked. ( std dev and mean only)

ss_mm <- ss %>% group_by(AGI, Genotype , Time) %>% 
        summarise(replicates_mean = mean(value) , stddev = sd(value)) %>% ## add stddev and mean 
        unite(Genotype, Time , col = "Genotype_Time" , sep = "_") %>% ## unite genotype and time in a single column
        gather(key = summary_type , value = value , replicates_mean , stddev) %>% ## create summary_type variable 
        unite(Genotype_Time, summary_type , col = "Genotype_Time_summary_type",sep = "_") %>% ##create Genotype_Time_summary_type variable
        spread(Genotype_Time_summary_type , value) ## wide format 

## summary of final table. 
glimpse(ss_mm)

Observations: 3
Variables: 19
Groups: AGI [3]
$ AGI                      <chr> "ATCG01240", "ATCG01310", "ATMG00070"
$ aox2_0h_replicates_mean  <dbl> 15.55477, 14.80490, 14.65538
$ aox2_0h_stddev           <dbl> 0.6240735, 0.2689779, 1.3299868
$ aox2_12h_replicates_mean <dbl> 14.88098, 15.01747, 14.93024
$ aox2_12h_stddev          <dbl> 0.42239203, 0.09092439, 0.60146632
$ aox2_24h_replicates_mean <dbl> 15.16378, 14.88365, 14.29683
$ aox2_24h_stddev          <dbl> 0.33059885, 0.07035039, 0.90767009
$ aox5_0h_replicates_mean  <dbl> 15.18685, 14.50503, 15.07261
$ aox5_0h_stddev           <dbl> 0.7175443, 0.7168420, 0.8547323
$ aox5_12h_replicates_mean <dbl> 14.57651, 15.05366, 14.96086
$ aox5_12h_stddev          <dbl> 0.07459644, 0.16231378, 1.28919700
$ aox5_24h_replicates_mean <dbl> 15.32162, 15.03121, 14.77809
$ aox5_24h_stddev          <dbl> 0.5483318, 0.1643006, 0.7481768
$ WT_0h_replicates_mean    <dbl> 14.19501, 14.09748, 15.02218
$ WT_0h_stddev             <dbl> 0.4794059, 1.2639163, 0.1382836
$ WT_12h_replicates_mean   <dbl> 14.84365, 14.16882, 14.69625
$ WT_12h_stddev            <dbl> 0.1521183, 0.8097963, 1.2471750
$ WT_24h_replicates_mean   <dbl> 15.32586, 14.31087, 14.92495
$ WT_24h_stddev            <dbl> 0.5509899, 0.7280381, 1.5358987
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Chirag Parsania1.4k

Thank you Chirag. Greatly appreciate your help. This will be really helpful for me.

However, my supervisor need a table as I've illustrated, with multiple comparison p-values. May I ask if you could guide me with this. Thanks again.

ADD REPLYlink written 4 weeks ago by Wuschel120
1

I'm sure you will be able to explore from the code I posted. For more reference regarding to pvalue, error bar and other statistics refere this.

ADD REPLYlink modified 4 weeks ago by genomax64k • written 4 weeks ago by Chirag Parsania1.4k
1

Check my updates in the answer.

ADD REPLYlink written 4 weeks ago by Chirag Parsania1.4k

Thank you Chirag :)

ADD REPLYlink written 4 weeks ago by Wuschel120

Hello Wuschel,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. Upvote|Bookmark|Accept

ADD REPLYlink written 4 weeks ago by bioExplorer3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1130 users visited in the last hour