Question

Correlation between two different datasets: between results of RNAseq and absence/presence of Type3 Secretion System

0

Entering edit mode

20 months ago

keshav.prasad.gubbi • 0

Dear All,

I have a "How would you solve" kind of question. I have two sets of tables : 1. Log2FoldChange table and 2. Effectors Table.

Firstly, the Log2FoldChange table was obtained by performing DESeq analysis of 14 different infected samples being compared to Control and then obtaining the foldchange values from DEseq for each sample and then merging all the 14 different log2foldchnage columns into a single table, based on genes (each row is a unique gene). This table is 22000 * 14. So there are 22414 unique genes for 14 different strains in this table.

Secondly, present/absent-effector list for all 14 strains. So it tells us which effectors are present in each strain (they all have different sets of effectors). This is a 50 * 14 table for the same set of 14 strains, with each unique effector enlisted in a row and indicating either 0 or 1 for absence or presence in the rows.

What we want to investigate is: is there a correlation between the presence/absence of effectors and the gene expression in the host? Essentially , we would like to obtain the correlation between these two separate datasets?

Any ideas/suggestions on how to go about solving this problem would be very helpful and useful. My Initial idea is to carry out a Canonical Correlation Analysis (CCA) and I am still working on it. But I am open to more ideas and suggestions from the community.

Thanks in advance for our time and suggestions.

Correlation DESeq RNAseq • 813 views

ADD COMMENT • link updated 20 months ago by LChart 3.9k • written 20 months ago by keshav.prasad.gubbi • 0

score 1 · Answer 1 · 2022-08-12

CCA is an interesting idea; but one drawback is the binary nature of the effector matrix typically does not work that well with L2 objectives; and I worry you're not going to get interpretable loadings on your effectors.

If you have sufficiently many replicates, you can do this directly in DESeq2 by including the effectors as a variable: ~ effector.A.status + strain + 0 (just cbind the effector matrix to your metadata). Because the effector status doesn't vary within strain, this will basically "pull out" an estimate of the group average of effector-positive and effector-negative strains, allowing for a direct comparison.

To look for more complicated patterns, you could select the differentially expressed genes across all strains, cluster the expression matrix on those genes, and overlay the effector status (I use a heatmap for such things).