Hello everyone,
I think I might have a good one today
I have RNA seq data of 16 related plant bacterial pathogens which have all been transformed with the same transcription factor which regulated a big virulence mechanism. We sequences RNA from these mutants and also from non transformed wildtype bacteria. I did DE analyses on these samples and the numbers of DEGs are highly variable between the species. In addition the spread of the foldchange between the species differs. If you check the volcano plots they all look quite different.
My objective was to look for the core regulon of this transcription factor amongst the different species, and I have identified it by identifying groups of orthologous genes amongst my species and checking for which orthogroups I could find enriched amongst my upregulated/downregulated genes. This means I have a list of orthogroups which are part of the core regulon of this transcription factor within these related species. This was all quite straightforward and worked well, what we expected to find was there.
Now, there are also a some orthogroups which are differentially regulated in these bacteria, sometimes up in some, sometimes down in others. I want to know if there is a relationship between these "orthogroups which show differential expression patterns across my different species" and the phylogeny of these species. What I thought of doing is using a k-means clustering, treating the data as if it is "time course data". Instead of using time as a variable I want to use the position of the phylogeny as a variable. An approach like that would give me something like this where sampling time would be my phylogenic tree: (this is just a figure I pulled from the web)
I was hoping doing something like that could be informative for me.
I am struggling to decide what data I should use to do such a clustering. I imagine the results would be very similar but I was thinking I would just ask in case some of you might have a strong opinion on it.
Should I use the foldchange of the orthgroups across my species or should I use a some sort normalized read count from each orthogroup?
The info in short again 16 bacteria species, all transformed and not transformed, 3 replicates per sample. Not all bacteria show the same number of DEGs, and the expression of the DEGs differs between the different samples.
Thank you all for reading this, Thomas