I'm trying to do DE analysis using DEseq (or DEseq2) to find species differences between RNA-seq samples (n=8) of two very closely related bird species.
In summary, I assembled two 'reference' transcriptomes out of all the Illumina reads (100bp PE) from every species using Trinity. I then blasted the two transcriptomes to obtain gene names for the assembled contig. Finally, I aligned the samples individually with their corresponding transcriptomes using BWA and obtained raw counts with eXpress. I'm now analyzing expression in R with DeSEQ.
The problem is that I sometimes have very similar sequences between both species that are have different gene names (according to a uniprot_sprot + nr blast, evalue < 1e-5, 1 result per query - I then blasted transcriptomes together to see if genes were corresponding to each other). Interestingly, it's usually only a subset of sequences (within the same "compxxxxx") that give different gene names, the other sequences have the same names. I'm not sure why this happens (alternative splicing/sequencing errors/other?) but it surely is a problem when I try to analyze gene expression. Some genes that appear to be 'overexpressed' in one species, when blasted against the other species' transcriptome, are very similar to other genes that, not surprisingly, are 'overexpressed' in the other species. It is most likely that these are in fact the same genes that are similarly expressed in both species.
I was thinking to remove all those ambiguous contigs, but it represents a lot of them (~50% of all contigs). Another idea I had was to blast a transcriptome with the other to obtain gene names instead of blasting against whole databases for every transcriptome.
What do you think would be the best approach to deal with this?