Hi all -
I working with a dataset of fastq files generated from standard 10x sc-RNA seq where each fastq file has a pool of 2-3 different samples and ~5-10k cells. What I would like to do is demultiplex the original fastq files into files that only contain 1 sample each. The layout of the reads is:
I1: Pool barcode (8bp) R1: Cell barcode (10bp) + UMI (16bp) R2: Template (100bp)
I have sample tracking key that links each cell barcode to a specific sample, but unfortunately, there is no barcode that corresponds specifically to all cells of a given sample. So, what I need is a workflow that can demultiplex and condense reads based on a group of barcodes (i.e. all cell barcodes that belong to a given sample).
I've had some success with
fgbio, which allows me to demultiplex into individual cells files, which I can name according to which sample they belong to, and then concatenate based on naming patterns. However, to do this, a single pool fastq will generate thousands of cell-specific demultiplex fastqs, which creates significant I/O issues.
Has anyone encountered alternative tools that can accomplish this in a more sophisticated way, short of writing something up from scratch?