Greeting, dear Biostars members!
I'm analysing RNA-seq data of a non-model specie (mollusc, no genetic information except rDNA and COX sequences). We've tried several assemblers (Trinity, rnaSPAdes and oases) for de novo transcriptome construction. Despite oases seems to be obsolete and not supported to the moment it worked best with our data (in terms of different technical metrics like median/mean transcript length, number of transcripts, etc. and biologically relevant BUSCO). One of the project goals is to perform differential expression analysis (DEA) between two conditions (salmon, tximport, EdgeR). After this step we of course annotate found differentially expressed transcripts (DET, sorry I cannot force myself to call it DEGs) by means of Blastx and InterProScan.
Oases assembler unlike Trinity and rnaSPAdes doesn't provide any information on gene isoforms. So, when I perform DEA I do it on a transcript and not gene level which is preferred (number of posts on Biostars and papers, e.g. tximport article). But really awfull thing comes next when I look through the annotation and see that there're lots of same Blastx hits in lists of Up and Down DETs. I can have 2-3 transcripts in both lists having a very big similarity to the same protein as found by Blastx. This way I have something like differential transcript usage which I want to summarise to a gene level somehow. Because of the expirement limitations our data is preliminary and it's certainly doesn't fit to perform such a fine analysis as a transcript-level differential expression is.
May be someone knows if there's a way to cluster transcripts to the gene level (not just using another assembler)? Btw, transcripts were initially clustered by CD-hit (0.95). I understand that other assemblers infer "isoform" information from the assembly process and it can be considered only as "predicted" but even such a level of certainty would help at the moment.
Thank you at least for reading to the end and all the best for you!