I am working on a project to compare gene expression between four tree species A, B, C and D. They belong to the same genus, but the phylogeny suggests B and C are from the same clade. A and D are from two different clades. For species A there is a ref genome (ca. 90,000 scaffolds), but de-novo transcriptome assembly is needed for B, C and D. For all species, there are 2 treatments (control, treatment) with 3 biological replicates for each treatment.
I used Trinity to assembly B, C and D individually and mapping rates with Salmon is good (>95%) for all species. I am thinking whether I should try genome-guided assembly using the genome of species A. Given they are from different clades, do you think would it be a problem?
I would also like to assess orthology between transcripts from the four species. One way I can think about is to use edgeR (exact test) calling differentially expressed genes (DEGs) for each species individually and then use Orthofinder to find ortholog groups for all-species DEGs comparison. The other is to cluster all transcripts/genes (from all species) and bringing them to DEG analysis altogether although I do not know how difficult the cluster of 600,000+ transcripts generated from Trinity will be and how complicated the analysis will be in R. At this point, I am leaning toward the first option, but I am inexperience in this kind of analysis. Could you give me some directions on how to implement this appropriately, please?
Thanks and looking forward to hearing your suggestions.