Hi all -
I working with a dataset of fastq files generated from standard 10x sc-RNA seq where each fastq file has a pool of 2-3 different samples and ~5-10k cells. What I would like to do is demultiplex the original fastq files into files that only contain 1 sample each. The layout of the reads is:
I1: Pool barcode (8bp)
R1: Cell barcode (10bp) + UMI (16bp)
R2: Template (100bp)
I have sample tracking key that links each cell barcode to a specific sample, but unfortunately, there is no barcode that corresponds specifically to all cells of a given sample. So, what I need is a workflow that can demultiplex and condense reads based on a group of barcodes (i.e. all cell barcodes that belong to a given sample).
I've had some success with DemuxFastqs
in fgbio
, which allows me to demultiplex into individual cells files, which I can name according to which sample they belong to, and then concatenate based on naming patterns. However, to do this, a single pool fastq will generate thousands of cell-specific demultiplex fastqs, which creates significant I/O issues.
Has anyone encountered alternative tools that can accomplish this in a more sophisticated way, short of writing something up from scratch?
Thanks!
You could separate the samples based on the Illumina index (there will be 4 per sample, you can find the combinations on the 10x support site) which you can find in
I1
file. At that point you can usecellranger
. Or just usealevin
to do entire processing. It is part ofsalmon
.I may be misunderstanding, but in this case, I'm not sure that will work. The "sample" index barcodes were used to correspond to sample pools and the initial fastq demultiplexing broke up fastq files based on those pools. So while each fastq corresponds to a pool of multiple samples, each sample in a pool has the same sample index (i.e.
I1
for each set of fastq files only has one unique index sequence in it).Your data files have been already demultiplexed using
cellranger fastq
but still contain 2-3 samples? Have you made some modification to the standard 10x protocol?Yes, I believe that is correct. Unfortunately, it's not data that I generated myself.