My experimental setup contains two Arabidopsis thaliana ecotypes (Col-0 and Bur-0) that are infected either with a virus or mock. I want to compare the response of these two genotypes to virus infection at the transcript level by RNA-seq. My problem is that although these ecotypes are closely related, they still have plenty of differences: mutations that can cause amino-acid changes in the coded proteins, rearranged/fused genes, novel alternative transcripts, etc. This means that the reference transcriptome for the alignment will be different.
Which of the following approaches do you think is the best?
Make a merged reference quasi-transcriptome (i.e., mix the two sequence sets with all the alternative transcripts, collapsing the exact duplicates), and perform a kallisto alignment to this mixed transcriptome. My reasoning for this is that the sequence variations in the two ecotypes can be treated as alternative transcripts, but this also results in an almost duplicated transcriptome, and I am not sure how it affects the models that the DE programs use.
Use the Col-0 reference transcriptome for the Col-0 reads and the Bur-0 transcriptome for the Bur-0 reads, perform the alignments and the differential gene expression separately and then try to compare them somehow (i.e., GO enrichment analysis). In this case, I have to relate the genes to each other in the two ecotypes (i.e., which gene a fused transcript in Bur-0 matches in Col-0).
Assemble a quasi-transcriptome de novo using the reads from both ecotypes (virus removed), annotate them, and perform a DE analysis. It will be a pain in the ass to map them back to the genome(s) and get gff/gtf files with genomic coordinates. An alternative (maybe better) version is that I assemble the two transcriptomes separately (to avoid building chimeras).
Use the Col-0 (and Bur-0) genome(s) to predict novel transcripts with Hisat2 and then either merge or treat them separately (see points 1 and 2).
Just use the Col-0 transcriptome (or genome) for both ecotypes and perform the differential gene expression analysis in one step, missing the largely different transcripts (which can be the most interesting). Maybe the structural variants (fused transcripts) can be predicted with some tools.
Some combination of the above possibilities.
I appreciate any suggestions.