How to extract a subset of genes that correspond to the top 5% most expressing genes in R?
1
0
Entering edit mode
4.8 years ago
chk • 0

I have a dataset containing data of normalized gene expression. I have in total 18703 genes in a wt strain against 5 different conditions (WT, A, B, C ,D), and each experiment has 10 biological replicates (WT1-10, A1-10, B1-10, C1-10, D1-10) My dataframe is 18704x61, and for the first two genes it looks like this (only for the wt strain and the first replicate of the A condition). (I used --- here just to get each condition over the corresponding value, ofc there aren't in my dataframe)

Gene WT1 ---- WT2---- WT3----- WT4 ---- WT5 ----- WT6 ---- WT7 -----WT8 --- WT9 ----- WT10 --- A1 ---....etc 1 5.727474 6.028452 6.051629 7.797896 7.716835 7.741452 7.908992 8.078486 8.291032 8.513065 5.548691 2 5.681870 6.808582 5.613130 6.022544 5.462886 6.606438 6.029086 6.982468 6.542026 6.455662 4.940125

I want to extract a subset of these genes that correspond to the top 5% most expressing. Any ideas?

R gene • 1.3k views
ADD COMMENT
0
Entering edit mode

criteria for most expressing?

ADD REPLY
3
Entering edit mode
4.8 years ago

There's a few ways to do this. I'll use the mean expression of the gene in all samples as the filtering criteria.

make some example data

counts <- replicate(6, rnorm(100, 10, 2))
counts <- cbind(seq_len(nrow(counts)), counts)
colnames(counts) <- c("Gene", sprintf("WT%s", 1:3), sprintf("A%s", 1:3))

> head(counts)
     Gene       WT1       WT2      WT3        A1        A2        A3
[1,]    1  9.778688 11.711185 11.67328 12.439523  7.627186  9.312048
[2,]    2 10.434498  9.267333 12.06317  9.199345 14.194996 10.491086
[3,]    3 12.381237 10.719864 10.68293 11.257819 11.740985  7.267453
[4,]    4 12.025958  9.169535 12.61042  8.477079  9.312525  7.915862
[5,]    5  8.553734 10.920719  8.91527  9.773861 10.481839 10.785579
[6,]    6 14.216497 11.359651 11.74663  8.490843  7.012496 10.871232

base R answer

head(counts[sort(-rowMeans(counts[,2:ncol(counts)]), index.return=TRUE)$ix, ], round(nrow(counts)*0.05))

     Gene       WT1      WT2      WT3        A1        A2        A3
[1,]   74  9.330106 12.58548 13.47696 10.268277 14.068676 12.175721
[2,]   71 13.755768 10.16943 10.60215 11.787978  9.027301 14.051500
[3,]   85 12.206216 13.83394 12.00315 11.316372 13.329750  6.243179
[4,]   12 11.890248 11.19514 11.38160 13.358240 10.482456 10.474823
[5,]   35 12.066719 13.89174 10.72214  7.978785 11.687865 11.765423

A tidyverse answer

library("tidyverse")

counts %>%
  as_tibble %>%
  mutate(rowmean=rowMeans(.[, 2:ncol(.)])) %>%
  slice_max(rowmean, prop=0.05)

# A tibble: 5 x 8
   Gene   WT1   WT2   WT3    A1    A2    A3 rowmean
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1    74  9.33  12.6  13.5 10.3  14.1  12.2     12.0
2    71 13.8   10.2  10.6 11.8   9.03 14.1     11.6
3    85 12.2   13.8  12.0 11.3  13.3   6.24    11.5
4    12 11.9   11.2  11.4 13.4  10.5  10.5     11.5
5    35 12.1   13.9  10.7  7.98 11.7  11.8     11.4
ADD COMMENT
0
Entering edit mode

That actually worked, using the head(). Thank you very much for your time. I was actually trying to make it work with something like

n<- 5

counts [counts[,2:61] > quantile(counts[,2:61],prob=1-n/100),]

but it was giving me this error

Error in [.data.frame(x, order(x, na.last = na.last, decreasing = decreasing)) : undefined columns selected

if u have any extra idea on that, I would appreciate it, too :)

ADD REPLY
0
Entering edit mode

Or with something like that

counts[which(rowSums(counts[,2:61])> ??? ),]

But I would stuck there, to the "???" because I didn't know how to enter this "top 5% most expressing genes" in a command with ">"

ADD REPLY

Login before adding your answer.

Traffic: 4160 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6