Gene manipulation in R
1
0
Entering edit mode
21 months ago
aj123 ▴ 120

Hi,

I have a table like below:

patient         geneid   base   count
  "ptp_1",    "BRCA1",  "C",   123,
  "ptp_1",    "BRCA1",  "G",   2,
  "ptp_1",    "BRCA1",  "T",   55,
  "ptp_2",    "BRCA2",  "A",   303,
  "ptp_2",    "BRCA2",  "C",   11
  "ptp_2",    "BRCA2",  "G",   1,

How to generate a wide data.frame that has one row per {patient x gene} and one column for each of the base's counts.

For example:

 participant   gene       A_count    C_count     G_count     T_count
 "ptp_1"       "BRCA1"     <values>
 "ptp_1"       "BRCA2"
 "ptp_2"       "BRCA1"
 "ptp_2"       "BRCA2"

I tried the following in dplyr but am not getting the exact result:

clean_df_mut_counts_wide <- clean_df_mut_counts %>% filter(base == "A") %>% group_by(participant) %>% group_by(gene) %>% summarise(A_count = sum(as.factor(base == "A")))
R • 1.4k views
ADD COMMENT
2
Entering edit mode
21 months ago
Basti ★ 2.0k

Using tidyr :

clean_df_mut_counts %>%  pivot_wider(
  names_from = base,
  values_from = count
)
ADD COMMENT
0
Entering edit mode

Danke! Im trying to calculate base frequency and second highest frequency base like this but its giving me a table without patient and gene-

clean_df_mut_counts_wide %>% 
    group_by(A, T, C, G) %>% 
      summarise(n = n()) %>% 
        mutate(freq= n/sum(n)) %>%
            top_n(n=2)
ADD REPLY
1
Entering edit mode

I do not see which frequency you would like in output, would you give an example ?

ADD REPLY
0
Entering edit mode
patient   gene       A_count    C_count     G_count     T_count     A_freq     C_freq     G_freq      T_freq 
 "ptp_1"       "BRCA1"     20        345          777       123
 "ptp_1"       "BRCA2"      30        33            320      43
 "ptp_2"       "BRCA1"     400        203           76      56
 "ptp_2"       "BRCA2"      82        100            0      102

The above frequencies of the bases and also find the second most frequently occurring base in each patient. Hope this clarifies. thank you.

ADD REPLY
0
Entering edit mode

Was able to achieve the above by following, after pivoting-

with_freq <- wide %>% 
  group_by(A, T, C, G) %>% 
  #summarise(n = n()) %>% 
  mutate(A_freq= A/sum(A+C+G+T), C_freq= C/sum(A+C+G+T), G_freq=G/sum(A+C+G+T), T_freq= T/sum(A+C+G+T))

Still not sure how to find 2nd most frequently occuring base per patient. Tried the following but it is not working-

with_freq_2nd_highest <- with_freq %>% slice(2)
ADD REPLY
1
Entering edit mode
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]    
max2=colnames(wide[,3:6])[apply(wide[,3:6], 1, maxn(2))]
with_freq$max2=max2
ADD REPLY

Login before adding your answer.

Traffic: 2110 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6