Question: Unexpected genes in mouse DGE (Digital Gene Expression) (gene-cell expression matrix)
Hi! I recently aligned our samples (using STAR and the drop-seq pipeline) to the mouse genome downloaded from the pre made meta data links fund on page 4 in the drop-seq cookbook. Link:

When processing the digital gene expression matrix with SingleR, the run was stopped because of duplicate gene names. I checked the gene names, and the reason for the duplicates were Rpl24, that existed both in capital and lower case letters. Like this: RPL24 <- human gene name Rpl24 <- mouse gene name

Checking the DGE for more of these tricky duplicates, I couldn't find any else. However, I found a couple of other strange occurrences of genes. I don't think they seem to fit in the mouse genome. For example: RP23-103I12.13 HOTAIRM1_5 KCDT12 TMEM185B

I double checked to make sure I aligned to the mouse genome and not the mixed genome. If these were contaminants they should have been sorted out at the aligning step, right? Looking at the ensembl numbers from the genome index in, the ensemble numbers for these genes does not exist. Actually, these genes usually doesn't have an ensembl number at all.

Anyone else that noticed this problem? Any idea how it can happen? Can I safely remove these strange genes from my DGE and move on?

Why do you have human gene symbols in there if you've aligned only to mouse? Is it possible that you've used the wrong GTF file?

That's what is so strange about it! I don't think I aligned to the mixed GTF file, my log says its the mouse, and in the geneinfo i can only see ENSMUS numbers. I'm very confused.

If you're using files you downloaded, check that they correspond to what they should be. I would also check all files for occurrences of human gene symbols to try and find out at which step they appear in the pipeline.

