Question

How to extract a subset of genes that correspond to the top 5% most expressing genes in R?

0

Entering edit mode

4.9 years ago

chk • 0

I have a dataset containing data of normalized gene expression. I have in total 18703 genes in a wt strain against 5 different conditions (WT, A, B, C ,D), and each experiment has 10 biological replicates (WT1-10, A1-10, B1-10, C1-10, D1-10) My dataframe is 18704x61, and for the first two genes it looks like this (only for the wt strain and the first replicate of the A condition). (I used --- here just to get each condition over the corresponding value, ofc there aren't in my dataframe)

Gene WT1 ---- WT2---- WT3----- WT4 ---- WT5 ----- WT6 ---- WT7 -----WT8 --- WT9 ----- WT10 --- A1 ---....etc 1 5.727474 6.028452 6.051629 7.797896 7.716835 7.741452 7.908992 8.078486 8.291032 8.513065 5.548691 2 5.681870 6.808582 5.613130 6.022544 5.462886 6.606438 6.029086 6.982468 6.542026 6.455662 4.940125

I want to extract a subset of these genes that correspond to the top 5% most expressing. Any ideas?

R gene • 1.4k views

ADD COMMENT • link updated 4.9 years ago by rpolicastro 13k • written 4.9 years ago by chk • 0

0

Entering edit mode

criteria for most expressing?

ADD REPLY • link 4.9 years ago by cpad0112 21k

score 3 · Answer 1 · 2020-12-13

There's a few ways to do this. I'll use the mean expression of the gene in all samples as the filtering criteria.

make some example data

counts <- replicate(6, rnorm(100, 10, 2))
counts <- cbind(seq_len(nrow(counts)), counts)
colnames(counts) <- c("Gene", sprintf("WT%s", 1:3), sprintf("A%s", 1:3))

> head(counts)
     Gene       WT1       WT2      WT3        A1        A2        A3
[1,]    1  9.778688 11.711185 11.67328 12.439523  7.627186  9.312048
[2,]    2 10.434498  9.267333 12.06317  9.199345 14.194996 10.491086
[3,]    3 12.381237 10.719864 10.68293 11.257819 11.740985  7.267453
[4,]    4 12.025958  9.169535 12.61042  8.477079  9.312525  7.915862
[5,]    5  8.553734 10.920719  8.91527  9.773861 10.481839 10.785579
[6,]    6 14.216497 11.359651 11.74663  8.490843  7.012496 10.871232

base R answer

head(counts[sort(-rowMeans(counts[,2:ncol(counts)]), index.return=TRUE)$ix, ], round(nrow(counts)*0.05))

     Gene       WT1      WT2      WT3        A1        A2        A3
[1,]   74  9.330106 12.58548 13.47696 10.268277 14.068676 12.175721
[2,]   71 13.755768 10.16943 10.60215 11.787978  9.027301 14.051500
[3,]   85 12.206216 13.83394 12.00315 11.316372 13.329750  6.243179
[4,]   12 11.890248 11.19514 11.38160 13.358240 10.482456 10.474823
[5,]   35 12.066719 13.89174 10.72214  7.978785 11.687865 11.765423

A tidyverse answer

library("tidyverse")

counts %>%
  as_tibble %>%
  mutate(rowmean=rowMeans(.[, 2:ncol(.)])) %>%
  slice_max(rowmean, prop=0.05)

# A tibble: 5 x 8
   Gene   WT1   WT2   WT3    A1    A2    A3 rowmean
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1    74  9.33  12.6  13.5 10.3  14.1  12.2     12.0
2    71 13.8   10.2  10.6 11.8   9.03 14.1     11.6
3    85 12.2   13.8  12.0 11.3  13.3   6.24    11.5
4    12 11.9   11.2  11.4 13.4  10.5  10.5     11.5
5    35 12.1   13.9  10.7  7.98 11.7  11.8     11.4