Question

R function for subsetting rows from one list based on certain conditions met from that list and another

0

Entering edit mode

2.0 years ago

jamzaleg84 ▴ 60

Hello everyone,

I'm not sure if I'm wording this question right but I'll provide a detailed example below. Basically I have a few files RNA-seq comparison files text that I made comparing different conditions and now what I want to filter out some of the differentially expressed gene based on another comparison.

Let's say I have two files that are structured the same way: One column has gene name (called gene_names), the other has the log2 fold change values (called log2foldchange), and the rest have the q value stats (I've already filtered all the files so that they have FDR of less than 0.05 and log2 fold change greater than 1). File one is comparing sample A vs B. File two compares C vs B. Initially I used the %in% operator like this

AvsB_filterout <- AvsB_comparison[ ! AvsB_comparison$gene_names %in% CvsB_comparison$gene_names, ]

But now what I want to do is filter out the genes which have a higher log2 fold change in the C vs B condition than the A vs B, because I found that a lot of genes that were filtered out which much more highly expressed in A vs B than C vs B.

Would anyone know how to rewrite my code so that I'm only filtering out genes which have a higher log2 fold change in C vs B than A vs B? I hope this makes sense.

Thank you so much,

Yonatan

subsetting r • 1.2k views

ADD COMMENT • link updated 2.0 years ago by Trivas ★ 1.7k • written 2.0 years ago by jamzaleg84 ▴ 60

score 1 · Answer 1 · 2022-04-20

1

Entering edit mode

2.0 years ago

Michael 54k

You can add additional conditions by using the 'bitwise' boolean operators & and |, for examples (I have shortened your variable names):

AB.filtered <-AB[( ! AB$gene_names %in% CB$gene_names ) & (AB$log2fc > CB$log2fc), ]

You have to check two things before doing this:

Both dataframes contain the exact same genes in the exact same order (all(AB$gene_names == CB$gene_names) must give TRUE without any warning) However, requiring this would break your set based filtering of gene names
The direction of the log fold change in the comparison is the way it should be, or possibly you rather want to compare absolute fold changes

Edit: You can do it, but all the filtering needs to be done in a single step. Now, AB and CB must be the original results before any filtering.

      AB.filtered <-AB[ abs(AB$log2fc) >= 1 & AB$p.adj < 0.05  &
                      ! (abs(CB$log2fc) >= 1 & CB$p.adj < 0.05) &
                        (AB$log2fc > CB$log2fc), ]

This is what you described: gene is significant in AB AND NOT in CB AND AB$log2fc is > BC$log2fc

ADD COMMENT • link 2.0 years ago by Michael 54k

1

Entering edit mode

Just now I noticed that condition 1 would not hold in your case, because you did the filtering before that step. If you want to use the boolean operators, you have do all the filtering in a single go.

ADD REPLY • link 2.0 years ago by Michael 54k

0

Entering edit mode

Thank you so much for your prompt reply. The problem is because these are differentially expressed genes that have already been subsetted based on certain thresholds the list of genes are not going to the exactly the same (the naming convention is the same it's just there may be differentially expressed genes in one list that isn't in the other).

Is there anyway to bypass this issue?

ADD REPLY • link 2.0 years ago by jamzaleg84 ▴ 60

1

Entering edit mode

You need to get the original unfiltered files. If you think about it a bit more, you will see that otherwise, your request doesn't make sense at all. Hint, you want to compare stuff that is in A but not in B ... :)

ADD REPLY • link 2.0 years ago by Michael 54k

1

Entering edit mode

Look into the semi_join and left_join functions in tidyverse. Semi_join will create a df with the gene names that are the same between the two files, then you can left_join to add the additional l2fc and pvals. Then you would apply your filter requiring l2fcA > l2fcB