I am aligning raw RNA-seq data from mouse samples for downstream analysis of differentially expressed genes. I would like to do this entirely based on UCSC data, to avoid mix-ups in the nomenclature and to be able to view bedGraph files in the UCSC Genome Browser later on. I used HISAT2 for the alignment, using default parameters and the Mm10 genome indexed by HISAT2.
So here is my question: which genome annotation should I use for assembling/quantifying the expressed genes (StringTie)?
I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser, but I get extremely high rates of ambiguous gene mappings with both of these (55-65%, based on analysis with QoRTs). I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website, which yielded only around 10% ambiguous mappings.
Can anyone tell me which annotation people generally use for UCSC genes? Or is this an issue with the UCSC genome index from HISAT2?
I'd appreciate any feedback on this.