Question: Demultiplexing 10x fastq files by GROUPS of barcodes
0
gravatar for bds1217
6 weeks ago by
bds12170
bds12170 wrote:

Hi all -

I working with a dataset of fastq files generated from standard 10x sc-RNA seq where each fastq file has a pool of 2-3 different samples and ~5-10k cells. What I would like to do is demultiplex the original fastq files into files that only contain 1 sample each. The layout of the reads is:

I1: Pool barcode (8bp)
R1: Cell barcode (10bp) + UMI (16bp)
R2: Template (100bp)

I have sample tracking key that links each cell barcode to a specific sample, but unfortunately, there is no barcode that corresponds specifically to all cells of a given sample. So, what I need is a workflow that can demultiplex and condense reads based on a group of barcodes (i.e. all cell barcodes that belong to a given sample).

I've had some success with DemuxFastqs in fgbio, which allows me to demultiplex into individual cells files, which I can name according to which sample they belong to, and then concatenate based on naming patterns. However, to do this, a single pool fastq will generate thousands of cell-specific demultiplex fastqs, which creates significant I/O issues.

Has anyone encountered alternative tools that can accomplish this in a more sophisticated way, short of writing something up from scratch?

Thanks!

rna-seq sequence assembly • 192 views
ADD COMMENTlink written 6 weeks ago by bds12170

You could separate the samples based on the Illumina index (there will be 4 per sample, you can find the combinations on the 10x support site) which you can find in I1 file. At that point you can use cellranger. Or just use alevin to do entire processing. It is part of salmon.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k

I may be misunderstanding, but in this case, I'm not sure that will work. The "sample" index barcodes were used to correspond to sample pools and the initial fastq demultiplexing broke up fastq files based on those pools. So while each fastq corresponds to a pool of multiple samples, each sample in a pool has the same sample index (i.e. I1 for each set of fastq files only has one unique index sequence in it).

ADD REPLYlink written 6 weeks ago by bds12170

each fastq file has a pool of 2-3 different samples and ~5-10k cells

Your data files have been already demultiplexed using cellranger fastq but still contain 2-3 samples? Have you made some modification to the standard 10x protocol?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k

Yes, I believe that is correct. Unfortunately, it's not data that I generated myself.

ADD REPLYlink written 6 weeks ago by bds12170
1
gravatar for GenoMax
6 weeks ago by
GenoMax94k
United States
GenoMax94k wrote:

I am not sure how this modification was done since it sound a bit mysterious. Anyway you can take a look at umi-tools (LINK). Not completely sure it will work.

ADD COMMENTlink written 6 weeks ago by GenoMax94k

Thanks, I'll check it out. I appreciate your time thinking about this.

ADD REPLYlink written 6 weeks ago by bds12170

umi-tools actually worked great and was way easier than I was expecting it to be. I thought I would need to use the regex function and build a super long query with all my barcodes, but because of the --whitelist argument, I could easily supply a list of those cell-barcodes as a text file instead. The reads in the output fastq would all be specific to one of the samples in the pooled fastq, exactly as desired. Thanks!

ADD REPLYlink written 6 weeks ago by bds12170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1596 users visited in the last hour
_