I'm trying to do a single sample pathway enrichment analysis with Kallisto/Sleuth. I have 3 control samples, and 3 mutated samples. I have good reasons to believe that the mutated samples have a larger number of genes/pathways differentially expressed in each sample individually, which masks a core set of genes or pathways, that are differentially regulated in all 3. I'm interested in both the common set of pathways and the sample specific ones, so simply comparing 3 control vs 3 mutated won't do it.
I was thinking about comparing the 3 control samples to the mutated samples one-by-one, to define mutated sample specific differentially expressed genes. I estimated transcript level expression with Kallisto, and used Sleuth to aggregate data at the gene level and do the usual differential expression with 3 controls vs 21 mutated sample. I have 3 lists of differentially expressed genes. So far so good (even though the results might not be super reliable).
However, I would really like to do a pathway level analysis with Sleuth instead of the gene level analysis. As Sleuth is working with transcript level data, I had to supply a transcript -> gene table, so it could aggregate transcript level data into gene level data. I can generate a transcript -> pathway table, for example with MSigDB/Reactome sets. However, many genes are part of several pathways, and Sleuth fails at the aggregation step.
reading in kallisto results dropping unused factor levels .... normalizing est_counts 88212 targets passed the filter normalizing tpm merging in metadata aggregating by column: pathway 15688 genes passed the filter Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 15599004 rows; more than 4701355 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I'm trying to figure out what to do with this, and I would appreciate any feedback or comments.
- Is it a reasonable approach at all to compare the 3 control replicates to single mutated samples?
- How would you do the aggregation where genes/transcripts belong to multiple pathways?