Hey everyone,
I was hoping to get some community insight into a confusing situation we're facing with our single-cell data and could use some suggestions.
Our lab works with non-model organisms (mainly pig tissues) and recently started using Fluentbio's Pipseq for our scRNA-seq experiments. They had a standalone software pipseeker for generating the indices for further downstream analysis. Illumina acquired Fluent and decided to kill PipSeeker and push DRAGEN.
We recently sequenced several pig organ samples and analysed the FASTQs using the original pipseeker pipeline and here are some stats : Reads Mapped with pipseeker: ~75% and Cells Detected with pipseeker: ~5,000
We sent the same files to the Illumina support team for troubleshooting. They re-analysed our data using their new, proprietary DRAGEN platform, which has effectively replaced PipSeeker. Their report showed drastically different numbers: Reads Mapped : >90% and Cells Detected: ~15,000 That's a big difference in the values between the 2 software.
When we asked for a technical explanation for this massive difference, support was vague. They just said that "DRAGEN uses a new and improved algorithm" and encouraged us to subscribe to the paid service after our 30-day trial ends.
This feels like a black box. We can't tell if the ~10,000 extra cells are real, high-quality cells that pipseeker missed, or if they are low-quality droplets, artifacts, or doublets that DRAGEN's new algorithm is failing to filter out. It's become a trust issue because we can't validate the output or understand the fundamental change in results.
Some details and some more questions
I'm trying to build a more transparent, open-source pipeline to understand what's going on, but the Pipseq barcode structure is quite complex: P(1-3bp) + Tier1(8bp) + ATG(3bp) + Tier2(6bp) + GAG(3bp) + Tier3(6bp) + TCGAG(5bp) + Tier4(8bp) + BinningIndex(3bp)
I'd be grateful for any advice on the following:
Has anyone else using Pipseq seen such a huge jump in performance when moving from PipSeeker to DRAGEN?
Does a 3x increase in cell detection from a software update alone seem plausible, or does this raise red flags for you, too?
What specific QC metrics should we examine (e.g., comparing knee plots, UMI counts, or gene distributions) to determine if these additional cells from DRAGEN are legitimate?
Do you know of any open-source tools (STARsolo, Kallisto/bustools, etc.) that can be configured to handle this kind of complex, tiered barcode structure?
We feel stuck between a free tool that might be underperforming and an expensive, opaque tool that gives us numbers that seem almost too good to be true.
Thanks in advance for any help or suggestions!
Is this within stated spec for the kit? It not you have every reason to be skeptical.
It is not the platform per se but the software/algorithm that is running on it. I assume this software is not available to download/use outside DRAGEN.
This structure looks like SPLITseq (a prior thread about it ( Using Kallisto with unsupported single tech (Edited) ). If there is no current support for this eventually a solution may come along.
Was PipSeeker an open source package? Looks like it has already been deprecated so not something you could plan to keep using in future. You are having to replace one closed source package with another. Not ideal but not much you can do, if there is no alternate open source solution available.
Pipseeker wasn't an open source project; it was a standalone software like cellranger. But now with the acquisition, Illumina wants everybody to use DRAGEN and use the cloud and pay for it all the same.
DRAGEN is a suite of tools, so yeah, that's that.
I have been in correspondence with the ILLUMINA tech support and till now they haven't been able to explain the discrepancy, other than saying DRAGEN uses a newer method employing hashtables and not using STAR unlike pipseeker. But the % reads mapped shouldn't vary so much, according to me. Anyways if I do manage to the bottom I will update
"DRAGEN uses a newer method employing hashtables" you mean the reference builder?
I think for mapping to the reference genome
DRAGEN are you using dragen 4.4 release for your workflow?
your best bet would be to write illumina tech support who might provide your a better answer for this discrepancy.
Cross-posted to: https://www.reddit.com/r/bioinformatics/comments/1mue6ht/huge_discrepancy_between_pipseeker_dragen_for/
If you get a satisfactory answer elsewhere on other forums, please come back and post that here to provide closure to this thread.
Sadly, I am yet to get a satisfactory answer.