Multiple samples and species with kallisto bus(tools)
Entering edit mode
10 weeks ago

Hello Biostars,

I'm working with a single nucleus RNAseq dataset generated via 10X. I've got 8x samples, which usually I'd run individually through cellranger and merge with Seurat. However, in this case at least some of the samples we expect that in addition to mouse cells, we will also have cells from a human source. Some searching has led to the use of kallisto bustools for mixed species single cell, in particular this tutorial that I've been attempting to follow. I've run this through and I'm getting an output that seems to make sense in terms of cell numbers - but, the matrix output doesn't allow to distinguish which droplet came from which sample, which is vital for this study.

The mixed species tutorial suggests that kallisto bus can handle multiple samples (coincidentally they also use 8 samples in the tutorial), but the manual doesn't list an option to keep the sample ID data. This separate bustools tutorial suggests that it might be a case of running kallisto bus for each sample, then aggregating the count matrices afterwards. Does anyone know if that is indeed the case, or am I missing something?

Thanks in advance.

kallisto scrnaseq bustools • 325 views
Entering edit mode
10 weeks ago
dsull ★ 4.2k


There are two questions here: 1) Mixed species, and 2) Multiple samples

Let's start with question 1:

You don't really know what cells belong to which species. Each cell is just some sort of barcode. You have to figure out the species yourself. You load up seurat or scanpy, filter for cells with sufficient UMIs, if the vast majority of UMIs are assigned to genes of species A, that cell probably belongs to species A. (For this task, you should do your mapping using a combined index of all species.)

Now that you have your cell_barcode:species assignment, run kallisto using individual species indices on your data. You'll get one count matrix per species, and then you can filter each count matrix for the barcodes belonging to a particular species.

You probably don't want to aggregate all the human+mouse count matrices together because human and mouse have different genes -- you'd probably want to analyze them separately.

Now let's move on to question 2:

For this, you can pool everything together and analyze them at once. However, I would caution against this (barcodes might clash between different samples, you lose information about which cell comes from which sample, etc.)

The best approach to this would be to analyze them separately and then aggregate the 8 different count matrices (in this way, you still retain your sample information).

Entering edit mode

Thanks @dsull. Inevitably just after I posted, I found that cellranger can also take a custom reference constructed from multiple species, so that adds another option into the mix.

I think the approach you suggest sounds reasonable though:

  1. run each sample on a combined index, then assign barcodes:species
  2. run each sample separately on species-specific index and filter for the relevant barcodes, then merge/aggregate downstream, retaining the sample ID

I will see how this goes, thank you again!


Login before adding your answer.

Traffic: 2204 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6