Currently, I am working on my very first RNAseq study and have met a dilemma where inputs from more experienced bioinformaticians would be amazing.
For a differential gene expression study in a non-model organism, a de novo reference transcriptome was assembled from 300 M reads in Trinity. For 3 experimental conditions (1 negative control, 1 positive control and the treatment of interest) triplicate samples were sequenced with a depth of 25 M reads.
The reference transcriptome was annotated with Trinotate.
For differential gene expression determination the Kallisto/Sleuth pipeline was being used - and here comes my dilemma of best practices:
A number of the Trinity transcripts could not be annotated by Trinotate (NA) and is being dropped in the Sleuth analysis when using the "so <- sleuth_prep(s2c, ~treat, target_mapping = annotation, aggregation_column = 'gene')" expression.
I played around with the annotation file and replaced the NA's in the gene column with the corresponding Trinity transcript IDs, which included some of the transcripts as significantly differentially expressed.
What is the right thing to do?
Would you let Sleuth drop the non-annotated transcripts, even though some of them are significantly differentially expressed?
Or, would you include these transcripts in the "gene" column with their corresponding Trinity transcript IDs, even though they cannot be analyzed on the gene-level (the transcript isoforms cannot be collapsed in the analysis like the annotated ones)?