Question

Create a custom filter for pachterlab Sleuth package for RNAseq data

0

Entering edit mode

3.9 years ago

jonathanU ▴ 40

My question has two parts. If I should have broken my question up into two posts, please let me know and I will do so.

TL;DR I want to completely understand a specific example of a custom filter function, so that I am able to apply a modified version (or create my own) to my own RNA-seq dataset using the Sleuth package in R.

Part 1 Background:

When preparing a Sleuth object with the function 'sleuth_prep', the default transcript filter function applied is known as 'basic_filter'. The code for basic_filter is:

basic_filter <- function (row, min_reads = 5, min_prop = 0.47) 
{
    mean(row >= min_reads) >= min_prop
}

This filters out all transcripts that do not have at least 5 estimated counts in at least 47% of the samples. From my understanding, basic_filter is suitable for single factor, two level contrast experiments. However, applying the basic_filter to datasets with multiple factors and levels might result in lost information (i.e. potentially interesting transcripts being filtered out). I would like to reference a specific example of someone creating a custom filter function for a two-factor RNA-Seq experiment: the aim was to minimize the loss of interesting transcripts expressed in the interaction of those two factors, while also trying to reduce the inclusion of false positives.

http://achri.blogspot.com/2018/02/are-you-losing-important-genes-in-your.html

> meta

        sample                 path condition cell_line
1    NSC_Ctl_1   NSC_Ctl_1.kallisto       NSC       Ctl
2    NSC_Ctl_2   NSC_Ctl_2.kallisto       NSC       Ctl
3    NSC_Ctl_3   NSC_Ctl_3.kallisto       NSC       Ctl
4     NSC_KO_1    NSC_KO_1.kallisto       NSC        KO
5     NSC_KO_2    NSC_KO_2.kallisto       NSC        KO
6     NSC_KO_3    NSC_KO_3.kallisto       NSC        KO
7  Odiff_Ctl_1 Odiff_Ctl_1.kallisto        OD       Ctl
8  Odiff_Ctl_2 Odiff_Ctl_2.kallisto        OD       Ctl
9  Odiff_Ctl_3 Odiff_Ctl_3.kallisto        OD       Ctl
10  Odiff_KO_1  Odiff_KO_1.kallisto        OD        KO
11  Odiff_KO_2  Odiff_KO_2.kallisto        OD        KO
12  Odiff_KO_3  Odiff_KO_3.kallisto        OD        KO

With the basic_filter applied:

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line)
#reading in kallisto results
#............
#normalizing est_counts
#26036 targets passed the filter
#normalizing tpm
#merging in metadata
#normalizing bootstrap samples
#summarizing bootstraps

The author states that transcripts known to be differentially expressed were missing after applying basic_filter. The custom function the author creates:

> design_filter <- function(design, row, min_reads=5, min_prop = 0.47){
    sum(apply(design, 2, function(x){
        y <- as.factor(x);
        return(max(tapply(row, y, function(f){sum(f >= min_reads)})/
                   tapply(row, y, length)) == 1 
                   || basic_filter(row, min_reads, min_prop)
              )
    })) > 0}

The result of applying the custom filter:

> so_redux <- sleuth_prep(meta, ~cell_line*condition, 
                          filter_fun=function(x){design_filter(so$design,x)})
#reading in kallisto results
#............
#normalizing est_counts
#26370 targets passed the filter
#normalizing tpm
#merging in metadata
#normalizing bootstrap samples
#summarizing bootstraps

The author states this custom filter requires ~25% of samples to have "moderate expression" of the transcript, while adding the constraint that those 25% cover all replicates of some factor level. The custom filter passed 334 more transcripts than the basic_filter, which also included the author's known "true positive" transcript to appear in the results.

Part 1 Question:

Could someone, with the patience of a public school teacher, please walk me through understanding the author's custom function? I have a basic understanding of R commands, but I'm in the learning stages of being able to combine them into more complex custom functions. Learning how to follow this specific function will improve my overall understanding of R, and encourage me to create my own custom functions in the future.

Part 2 Background

I am processing RNA-seq data for my lab using Kallisto and Sleuth. The experiment includes 4 genotypes and 2 conditions (control vs treatment). In total there are 8 different combinations, and for each combination there are 3 biological replicates. The overall number of samples for this experiment is 8 x 3 = 24. The metadata (path information hidden):

> meta
     sample genotype condition path
1   ad_neg2       ad   control    x
2   ad_neg3       ad   control    x
3   ad_neg5       ad   control    x
4  ad_plus1       ad treatment    x
5  ad_plus2       ad treatment    x
6  ad_plus3       ad treatment    x
7   my_neg1       my   control    x
8   my_neg2       my   control    x
9   my_neg4       my   control    x
10 my_plus2       my treatment    x
11 my_plus3       my treatment    x
12 my_plus4       my treatment    x
13  pc_neg1       pc   control    x
14  pc_neg2       pc   control    x
15  pc_neg3       pc   control    x
16 pc_plus1       pc treatment    x
17 pc_plus3       pc treatment    x
18 pc_plus4       pc treatment    x
19  wt_neg2       wt   control    x
20  wt_neg3       wt   control    x
21  wt_neg5       wt   control    x
22 wt_plus1       wt treatment    x
23 wt_plus2       wt treatment    x
24 wt_plus3       wt treatment    x

Part 2 Question:

I intend to create a Sleuth object with a model similar to the referenced author in Part 1 ~ genotype+condition+genotype:condition and apply a similar custom filter function (i.e. require 12.5% of samples to have 5 mapped reads to a transcript, adding the constraint those 12.5% cover all replicates of a factor level). Does this filtering step seem reasonable? If so, how might I create a custom function for this?

Thank you so much for your time.

RNA-Seq sleuth rna-seq kallisto R • 1.6k views

ADD COMMENT • link 3.9 years ago by jonathanU ▴ 40