Question

How to change cellranger multimapping algorithm

1

Entering edit mode

5.3 years ago

changxu.fan ▴ 70

Hi~ I'm currently using 10x cellranger to analyse single cell RNA-seq data. According to their algorithm, reads mapping confidently to more than one exons will be discarded. However, there are paralogous genes in the genome that are largely identical and all the reads for such genes are discarded. Therefore, I was wondering if there is a way to change the algorithm to count the first (or a random) confident alignment. Unfortunately, I wasn't able to locate the file containing the algorithm. Any hints would be appreciated. Thanks every one!

RNA-Seq • 6.6k views

ADD COMMENT • link updated 3.3 years ago by predeus ★ 1.9k • written 5.3 years ago by changxu.fan ▴ 70

0

Entering edit mode

There's another option now, if you do not want to use pseudo-alignment algorithms used by salmon/alevin and kallisto. STAR has a workflow names STARsolo that allows you to get results that correlate with cellRanger very well, but correctly account for multimappers.

ADD REPLY • link 3.3 years ago by predeus ★ 1.9k

1

Entering edit mode

Is there any documentation available what STARsolo does with multimappers? The manual does not seem to mention it specifically towards STARsolo.

ADD REPLY • link 3.3 years ago by ATpoint 81k

1

Entering edit mode

As far as I am aware, neither Cell Ranger, nor STARSolo "handle" gene ambiguous reads. Such reads are discarded by those pipelines, as the UMI resolution algorithm assumes related UMIs --- UMIs that will be deduplicated --- align to the same gene.

ADD REPLY • link 3.3 years ago by Rob 6.5k

0

Entering edit mode

Ok, perhaps I am confused - is STARsolo and STARsolo-Quant the same thing?

I've looked through this presentation and it describes what sounds like a typical EM-based approach that can account for multimappers, similar to what rsem/kallisto/salmon are using:

https://f1000research.com/slides/8-1897

ADD REPLY • link 3.3 years ago by predeus ★ 1.9k

2

Entering edit mode

STARsolo is not STARsolo-Quant. STARsolo is the single-cell mode of STAR that is actively developed, maintained, and improved. It is usable today as a (much more efficient) and near drop-in replacement of Cell Ranger. It uses a UMI resolution algorithm specifically designed to be very similar to the one used in Cell Ranger. STARsolo-Quant, on the other hand, is a protocol discussed in the slides you link and about which there was a talk at Genome Informatics in 2019. It was / is a research project, but I don't know of any official documentation on how to run or use the protocol. Also, it works differently than STARsolo itself (or alevin, Cell Ranger, or kallisto), in that all available documentation suggests that it performs multimapping resolution (a) at the transcript-level and (b) only at the cluster level. On the other hand, alevin performs gene-level multi-mapping resolution, but does so at the cell-level. The other methods (including STARsolo), discard gene multi-mapping UMIs at the cell level, and so they are not considered in the gene x cell count matrix that results as output of those tools.

ADD REPLY • link 3.3 years ago by Rob 6.5k

1

Entering edit mode

Thank you, this is very helpful.

ADD REPLY • link 3.3 years ago by predeus ★ 1.9k

1

Entering edit mode

5.3 years ago

Kristoffer Vitting-Seerup ★ 4.0k

I don't think cellranger can do this - but the tool Alevin (github, biorxive paper) does support multi-mapping read/UMIs since it builds on Salmon quantification. Since it builds on Salmon the quantifications will also be more accurate (and much faster).

ADD COMMENT • link 5.3 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 5 · Accepted Answer · 2019-01-06

5

Entering edit mode

5.3 years ago

Rob 6.5k

The cellranger UMI deduplication algorithm does not handle reads that map among multiple genes, there is no "easy" way to handle this situation. You may be interested in taking a look at our quantification tool, alevin, which we've designed, in part, to help deal with these cases. In addition to having a methodology for handling reads that map between multiple genes, it is much faster than cellranger.

ADD COMMENT • link 5.3 years ago by Rob 6.5k

2

Entering edit mode

Thanks a lot! Those reads previously discarded by cellranger are showing up in my analysis now! It's an amazing tool.

ADD REPLY • link 5.3 years ago by changxu.fan ▴ 70

0

Entering edit mode

May I ask in the Alevin output quant_mat.csv files, why are there numbers with decimals? Are the numbers still representing the number of transcripts detected? I plan to use Seurat to perform downstream data cleaning and clustering but I'm not sure if I should still perform all the normalization, etc, as I would normally do for cellranger count - generated data. Thank you so much!

ADD REPLY • link 5.2 years ago by changxu.fan ▴ 70

2

Entering edit mode

There's a tutorial of how to use alevin with seurat here. The fractional values are not due to any normalization, but because it is sometimes impossible to resolve gene-ambiguous UMIs deterministically (based on parsimony). In that case, alevin resolves these UMIs probabilistically.

ADD REPLY • link 5.2 years ago by Rob 6.5k

0

Entering edit mode

I'm really sorry to bother again, but we recently got some "paired end" sequencing data, using 5' capture protocol and thus both R1 and R2 contains more than 150 bp. I was wondering if alevin can adapt to this? Thank you so much, Fan

ADD REPLY • link 5.2 years ago by changxu.fan ▴ 70

0

Entering edit mode

thus both R1 and R2 contains more than 150 bp

What does that exactly mean? You have reads longer than 150 bp for R1 and R2 each?

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

Usually with 3' capture R1 is used only for barcodes and UMI, so we only sequence 26bp. But with 5' capture, R1 is barcode + UMI + useful sequence....

ADD REPLY • link 5.2 years ago by changxu.fan ▴ 70