Hi, I should start by saying that I am very new to the RNAseq analysis world. I am analyzing a RNAseq data set from a de novo assembly and am having an issue with duplicate annotations. The assembly was done using Trinity by someone other than myself. I was doing the differential expression analysis using DESeq2 in R, using the raw counts as the input for the analysis. The analysis goes pretty smoothly: I extract the contigs that are upregulated in one treatment versus the other and then merge this list of differentially expressed genes with their annotation so I know what they are.
This is where I am having issues.
I have found that contigs that are in the upregulated list are annotated as genes that also match contigs in the downregulated list of the same comparison. So basically, the same gene is in the upregulated and downregulated lists. And this happens with more than one gene.
I have inherited this data set and I don't know if it was trimmed for duplicate contigs and so I am doing that as we speak. However, I know of a few instances where two different contigs align to disjunct regions of the same gene; this is not something that would be solved by removing addressing duplicate contigs, correct?
So my questions for the forum are these:
Would it be valid to add the counts for all of the contigs that map to the same gene, so that there is only one count value per treatment for that gene? Why or why not?
Or are these different isoforms of the same gene and thus should be kept separate? Is there a way to confirm this is the case?
What other solutions exist to solve this issue (multiple contigs matching the same gene)? Or is this an acceptable occurrence in RNAseq?
Also, if you have references that corroborate your answer, listing those would be greatly appreciated. Thanks in advance for your help!