Demultiplexing 10x fastq files by GROUPS of barcodes
1
0
Entering edit mode
3.4 years ago
bds1217 • 0

Hi all -

I working with a dataset of fastq files generated from standard 10x sc-RNA seq where each fastq file has a pool of 2-3 different samples and ~5-10k cells. What I would like to do is demultiplex the original fastq files into files that only contain 1 sample each. The layout of the reads is:

I1: Pool barcode (8bp)
R1: Cell barcode (10bp) + UMI (16bp)
R2: Template (100bp)

I have sample tracking key that links each cell barcode to a specific sample, but unfortunately, there is no barcode that corresponds specifically to all cells of a given sample. So, what I need is a workflow that can demultiplex and condense reads based on a group of barcodes (i.e. all cell barcodes that belong to a given sample).

I've had some success with DemuxFastqs in fgbio, which allows me to demultiplex into individual cells files, which I can name according to which sample they belong to, and then concatenate based on naming patterns. However, to do this, a single pool fastq will generate thousands of cell-specific demultiplex fastqs, which creates significant I/O issues.

Has anyone encountered alternative tools that can accomplish this in a more sophisticated way, short of writing something up from scratch?

Thanks!

RNA-Seq sequence assembly • 2.8k views
ADD COMMENT
0
Entering edit mode

You could separate the samples based on the Illumina index (there will be 4 per sample, you can find the combinations on the 10x support site) which you can find in I1 file. At that point you can use cellranger. Or just use alevin to do entire processing. It is part of salmon.

ADD REPLY
0
Entering edit mode

I may be misunderstanding, but in this case, I'm not sure that will work. The "sample" index barcodes were used to correspond to sample pools and the initial fastq demultiplexing broke up fastq files based on those pools. So while each fastq corresponds to a pool of multiple samples, each sample in a pool has the same sample index (i.e. I1 for each set of fastq files only has one unique index sequence in it).

ADD REPLY
0
Entering edit mode

each fastq file has a pool of 2-3 different samples and ~5-10k cells

Your data files have been already demultiplexed using cellranger fastq but still contain 2-3 samples? Have you made some modification to the standard 10x protocol?

ADD REPLY
0
Entering edit mode

Yes, I believe that is correct. Unfortunately, it's not data that I generated myself.

ADD REPLY
1
Entering edit mode
3.4 years ago
GenoMax 141k

I am not sure how this modification was done since it sound a bit mysterious. Anyway you can take a look at umi-tools (LINK). Not completely sure it will work.

ADD COMMENT
0
Entering edit mode

Thanks, I'll check it out. I appreciate your time thinking about this.

ADD REPLY
0
Entering edit mode

umi-tools actually worked great and was way easier than I was expecting it to be. I thought I would need to use the regex function and build a super long query with all my barcodes, but because of the --whitelist argument, I could easily supply a list of those cell-barcodes as a text file instead. The reads in the output fastq would all be specific to one of the samples in the pooled fastq, exactly as desired. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6