I've set up a pipeline to test for differential gene expression across different species (corals).
Cross-species testing is not trivial, and my strategy is to take the raw gene counts per each species (alignment to each genome done with STAR and assembly with Stringtie) and normalize by gene length (I would extract this info from the stringtie merged.gtf). I would then find orthologs for all genes (using Orthofinder and not NCBI Homologene, as corals are not well represented there) and use the tool xSpecies described in this paper, which allows to incorporate gene homology relationships to align expression profiles between species, rather than relying solely on sequence similarity or orthology predictions (several papers instead find orthologs and just take the ones with a 1-to-1 correspondence, but this might not always be the case for all species). I would then use the normalized counts with the homology info for differential expression using limma, which can take already normalized data (Deseq does not).
My question is, does this approach for cross-species differential gene expression analysis look appropriate?
Another thing that I would like to do is compare which genes are differentially expressed and which pathways are enriched in the cross-species comparison (all corals at a control condition) with those found in within-species comparisons (each species is also exposed to a treatment) - e.g., stress gene A is upregulated in species Y compared to the other 2 species at the control condition (cross-species) and it's also upregulated in species Y with increasing temperature (within species). But would this comparison be correct? As I would use different methods to analyze the data, xSpecies+limma for the cross-comparison and Deseq for the within-comparison.
Thanks a lot!! Any feedback would be super appreciated!
I don't know that much about this area, but you might consider OMA instead or in addition to Orthofinder.
I am also interested in these questions. I hope some of the more experienced and/or knowledgeable members here respond.
How many of these are you generating the data for yourself and how many are you going to try and use public data for.
Hi GenoMax, I'm not sure I got your question right. I'm using genome and gtf info that are publicly available to map and assemble the transcripts for all 3 species.
Are you assembling the transcripts from RNAseq data or just extracting the sequence from genome based on GTF?
In case mentioned below by cfos4698 only a part of the transcriptome data was from public sources so the private data used was likely generated in a controlled way (e.g. same library protocol, sequencing etc). If you are going to use only public data then you should keep an eye out for batch effects that may have nothing to do with biology.
I'm doing a genome-guided assembly (using publicly available genomes info) of the transcripts, we generated RNAseq data for all the species, which were all treated the same way (same protocols for extraction, library prep and sequencing). Thank you for the feedback!