DropletQC has been an alternative way to filter empty droplets, using droplet nuclear fraction summarized from the CellRanger-specific RE:A bam tags, plotted against log-UMI counts, to make the empty droplets stand out. As far as I'm concerned, DropletQC results in equal or better cell calling than DropletUtils::emptyDrops, especially on the high-UMI, low-nuclear fraction end.
Here is an output from identify_empty_drops using CellRanger bam as a source. Alternatively, if you have Velocyto matrices from an RNA velocity analysis, you can use them too, to derive the nuclear fraction number.
I have run the same sample through STARSolo, and here is what you may find obstructive when using the Velocyto output from STARSolo. If you'd like to extend the UMI threshold lower like I do, you will find the lowermost rows are discretized. This makes DropletQC's density estimation algorithm generate a lot of ripples and become trapped at a local minimum.
In trying to fix the issue (github), I discovered a pretty neat relationship.
You see, in the scatter plots, there seems to be a curve that can delineate the two groups, from upper left to lower right. Fortunately, when I switch the plot to log-log axes, it becomes clear that a slanted line runs through the gap between the two.
By fixing the slope to -1, I can write this linear formula
log(y) = -log(x) + intercept
which rearranges to
xy = constant.
And since x, the nuclear fraction, equals Unspliced count over UMI count, and y is the UMI count, we find the gap line can be expressed simply as
Unspliced = constant.
And so, we can plot log(Unspliced count) to log(UMI count), and get a vertical gap instead.
This suggests that our empty droplets can be delineated by a threshold on the unspliced count. Isn't that neat? I hope we can continue to find more cases where this holds.
But one caveat. If you work from CellRanger bam tags, there's a chance that this won't hold. In DropletQC's nuclear_fraction_tags and nuclear_fraction_annotation functions, they forgot to account for duplicated molecules, so the nuclear fraction calculated is based on all the reads, not Unique Molecular counts. I believe this has to be fixed, for a UMI-UMI plot makes more sense than if one side is not UMI. I'm also showing the upper bound of Unspliced counts for each UMI level on the previous plot and the next, which is exceeded in the next one because it uses the number of sequencing reads rather than unique molecular counts. (For this I have to multiply nuclear fractions by read counts from molecule_info.h5, which has removed some reads from resolving UMI clashes; thus the recovered unspliced counts are hardly whole numbers.)
Update
The unspliced threshold isn't necessarily constant, or a vertical line. For example, in the 10X13_4 sample (E14.5 mouse dorsal forebrain, La Manno et al. 2021), a slope-2 line fits through the gap.
Here is a code snippet that makes the plot from a list of Velocyto matrices.
library(Matrix)
DropletQC <- function(velocyto, lower = 30, count.ambiguous = TRUE) {
## velocyto: list(spliced, unspliced, ambiguous)
umi.table <- sapply(velocyto, colSums)
umi.count <- rowSums(umi.table)
unspliced <- umi.table[, 2] + if (count.ambiguous) umi.table[, 3] else 0
cbind(unspliced, umi.count)[umi.count > lower, ]
}
## usage
x <- DropletQC(vel.list)
plot(x, log = "xy")
abline(a = 0, b = 1)
Say you find a line that goes through the gap,
y = bx - a
Then the passing condition is
Unspliced^b > UMI * 10^a.