Hi! I recently aligned our samples (using STAR and the drop-seq pipeline) to the mouse genome downloaded from the pre made meta data links fund on page 4 in the drop-seq cookbook. Link: https://github.com/broadinstitute/Drop-seq/blob/master/doc/Drop-seq_Alignment_Cookbook.pdf
When processing the digital gene expression matrix with SingleR, the run was stopped because of duplicate gene names. I checked the gene names, and the reason for the duplicates were Rpl24, that existed both in capital and lower case letters. Like this: RPL24 <- human gene name Rpl24 <- mouse gene name
Checking the DGE for more of these tricky duplicates, I couldn't find any else. However, I found a couple of other strange occurrences of genes. I don't think they seem to fit in the mouse genome. For example: RP23-103I12.13 HOTAIRM1_5 KCDT12 TMEM185B
I double checked to make sure I aligned to the mouse genome and not the mixed genome. If these were contaminants they should have been sorted out at the aligning step, right? Looking at the ensembl numbers from the genome index in geneinfo.tab, the ensemble numbers for these genes does not exist. Actually, these genes usually doesn't have an ensembl number at all.
Anyone else that noticed this problem? Any idea how it can happen? Can I safely remove these strange genes from my DGE and move on?