Question: How to change cellranger multimapping algorithm
Hi~ I'm currently using 10x cellranger to analyse single cell RNA-seq data. According to their algorithm, reads mapping confidently to more than one exons will be discarded. However, there are paralogous genes in the genome that are largely identical and all the reads for such genes are discarded. Therefore, I was wondering if there is a way to change the algorithm to count the first (or a random) confident alignment. Unfortunately, I wasn't able to locate the file containing the algorithm. Any hints would be appreciated. Thanks every one!

The cellranger UMI deduplication algorithm does not handle reads that map among multiple genes, there is no "easy" way to handle this situation. You may be interested in taking a look at our quantification tool, alevin, which we've designed, in part, to help deal with these cases. In addition to having a methodology for handling reads that map between multiple genes, it is much faster than cellranger.

Thanks a lot! Those reads previously discarded by cellranger are showing up in my analysis now! It's an amazing tool.

May I ask in the Alevin output quant_mat.csv files, why are there numbers with decimals? Are the numbers still representing the number of transcripts detected? I plan to use Seurat to perform downstream data cleaning and clustering but I'm not sure if I should still perform all the normalization, etc, as I would normally do for cellranger count - generated data. Thank you so much!

There's a tutorial of how to use alevin with seurat here. The fractional values are not due to any normalization, but because it is sometimes impossible to resolve gene-ambiguous UMIs deterministically (based on parsimony). In that case, alevin resolves these UMIs probabilistically.

I'm really sorry to bother again, but we recently got some "paired end" sequencing data, using 5' capture protocol and thus both R1 and R2 contains more than 150 bp. I was wondering if alevin can adapt to this? Thank you so much, Fan

thus both R1 and R2 contains more than 150 bp

What does that exactly mean? You have reads longer than 150 bp for R1 and R2 each?

Usually with 3' capture R1 is used only for barcodes and UMI, so we only sequence 26bp. But with 5' capture, R1 is barcode + UMI + useful sequence....

I don't think cellranger can do this - but the tool Alevin (github, biorxive paper) does support multi-mapping read/UMIs since it builds on Salmon quantification. Since it builds on Salmon the quantifications will also be more accurate (and much faster).

