Question

Help with kallisto-bustools: how to interpret the inspect.json output file

0

Entering edit mode

23 months ago

bompipi95 ▴ 150

Hi everyone,

I am trying to understand the entries of the inspect.json file that is output by the kb-python wrapper tool by the Pachter lab. I believe that this file is an output from bustools. I have analysed (performed pseudoalignment and quantification with kb-python) the standard PBMC_1K_V3 scRNAseq dataset from 10x chromium.

Here is a screenshot of my inspect.json file obtained from running kb-python on the PBMC_1K_V3 dataset:

{
"numRecords": 18768851,
"numReads": 39738410,
"numBarcodes": 518890,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 76.583496,
"numUMIs": 8050936,
"numBarcodeUMIs": 13514564,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 26.045143,
"gtRecords": 7713525,
"numBarcodesOnWhitelist": 239530,
"percentageBarcodesOnWhitelist": 46.162000,
"numReadsOnWhitelist": 38105469,
"percentageReadsOnWhitelist": 95.890774
}

My questions are:

What do the various keys mean? For example, what does numRecords and numBarcodes mean? I suppose numRecords is the number of entries of the BUS format, but why is numBarcodes so high? The fastq files analysed corresponds to approximately 1000 cells, so why does the number of barcodes (I assume cell barcodes) exceed the number of cells by this much? I am assuming that number of corrected barcodes = number of cells.
How important is this file for checking the performance of kb-python? I can understand the output of the kb_info.json and run_info.json files as they are self-explanatory, but the contents of inspect.json baffles me and I have yet to find any documentation of its contents online.

For your information, I share the cellRanger output cellRanger_summary and the run_info.json screenshot

enter image description here .

Thank you for reading and I appreciate your time.

scRNAseq bustools kallisto • 1.1k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 23 months ago by bompipi95 ▴ 150

score 1 · Answer 1 · 2022-05-20

1

Entering edit mode

23 months ago

bompipi95 ▴ 150

On further thought and reading, my assumption that number of barcodes = number of cells is incorrect for the droplet based sequencing technologies. The majority (more than 90%, see Zheng et al., 2017) of GEMs (gel beads in emulsion), which are beads containing UMIs and barcodes and oligodT tags, actually contain ambient RNA or RNA from lysed cells and thus do not contain any cell (See Zijian et al. 2020). This would explain the need for cell number estimation to distinguish real cells from background using various methods like emptyDrops (Lun et al., 2019) and CB2. Hope this helps.

ADD COMMENT • link 23 months ago by bompipi95 ▴ 150

1

Entering edit mode

Correct, which is why you should load your output into R or python and filter your cells (i.e. do the "knee plot"). See the kallisto | bustools tutorials for more details.

You don't really need to use that file to check performance/QC (especially since your output is unfiltered) -- you're better off loading your output into R or python, filtering your cells, and doing QC from there.

ADD REPLY • link 23 months ago by dsull ★ 5.8k