I am analyzing some RNA-seq data generated from a treatment/control experiment on the NG108-15 neuroblastoma-glioma hybrid cell line. The sequencing was done using Illumina SE 100 bp reads.
When aligning the reads back against the reference -- I used both HISAT2 and Tophat against mm10 and grcm38 -- approximately 20-30% of the reads align to multiple loci.
Upon further inspection, it looks like a large fraction of the multiply aligning reads hit against repeats of a "targeted KO-first, conditional ready, lacZ-tagged mutant allele." I've also subsampled 1M reads from my samples and aligned them using Blast against accession JN958699.1 . Counting only perfect Blast matches shows that about 12% of the 1M reads samples align perfectly on JN958699.1
Both the mm10 and grcm38 references seems to contain hundreds of paralogs of that LacZ-tagged mutant allele.
Any one knows what the repeat "targeted KO-first, conditional ready, lacZ-tagged mutant allele." is involved in and what would lead to its enrichment in an RNA-seq experiment? Note that the multiply aligning reads are as abundant in the control as in the treatment.
Thank you so much for any suggestions of hints you might be able to offer.