I am working on RNASeq data for DE from a non-traditional organism (Artemia franciscana, a eukaryotic arthropod) whose draft genome was published last year. I should note here that although the genome was assembled with a mix of Illumina and PacBio reads, that it's still fragmented into 16,000 scaffolds. Only ~50% of BUSCO proteins were found complete in its annotation. Paper describing its assembly can be found here.
We used polyA selection on 20 different samples and I've tried mapping the 2x150bp reads to the genome using hisat2. 65-75% of reads aligned to the genome (varied a bit by sample) but of those that aligned, only about 25% mapped in exonic regions while another 15% were in intronic regions and the remaining 60% or so were in intergenic regions (based on Qualimap with proportional assignments for multimapped reads (roughly 10% of reads)).
The most common answer to questions like this I've found here is suspected DNA contamination. I suspect however that this may be skewed by people working with high quality genome annotations (e.g. human, mouse). Given my genome, how likely do you think it is that this can be explained by poor annotation of real mRNAs my RNASeq data have captured?
Other relevant clues:
- The 20 samples were collected, RNA extracted, and sent for sequencing in 4 different batches months apart from each other. Any contamination would have to be systematic with the protocol used.
- I aligned my reads to a transcriptome also published in the genome paper. Aligning with both hisat2 and bowtie2 both resulted in only 25-35% of my reads mapping.
- 5'-3' bias when aligned to the genome was between 0.95 and 1.2 across all samples (Qualimap).
You could try assembling your own transcriptome using the data you have. Sounds like you have a number of samples. Then compare your transcriptome to the published one. It is possible that you are going to add "new" genes based on the quality of your assembly and what is already published.