Question: Demultiplexing fastq file on Identifier line allowing mismatch
gravatar for pat2402
5.0 years ago by
pat24020 wrote:


I have a multiplexed fastq file that contain reads as following:

@HISEQ:55:H76W4HIWA:1:1101:3414:2138 1:N:0:BC1:BC2:BC3
@HISEQ:55:H76W4HIWA:1:1101:6230:2144 1:N:0:BC1:BC2:BC3

I have a quasi paired-end sequencing but the second read only contains two barcodes (BC2 and BC3). Therefore I transferred BC2 and BC3 from read2 to the header of read1 (together with BC1, part of read1 sequence). I want to demultiplex this file by the barcodes (e.g. "BC1:BC2") in the identifier line. The barcodes are known but I need to demultiplex the fastq file by allowing one mismatch for BC1 and BC2. I tried fastq-grep, but unfortunately its not possible to allow a mismatch. Have you any suggestions?

I would be very happy about every kind of help. Thank you.

ps. I can also change the delimiters between barcodes..

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by pat24020

You can demultiplex FASTQ files while allowing mismatches in the barcodes with the tool TagDust 2, but by design it will not let you control the exact number of mismatches. (This is why I post this as a comment rather than as an answer). You can find a benchmark comparing it with other tools in its publication.

ADD REPLYlink modified 4 months ago by RamRS25k • written 5.0 years ago by Charles Plessy2.7k

Doesn't having the barcodes in the ID line also mean that the data has been already demultiplexed and the barcode information is not actually present in the data. When the Casava pipeline (that produced this data is run) you have the choice of inputting the number of mismatches.

ADD REPLYlink written 5.0 years ago by Istvan Albert ♦♦ 82k

The reads are multiplexed. I will edit my post to make my problem a little bit more clear. 

ADD REPLYlink written 5.0 years ago by pat24020

If you give some details about your experiment, it would be easy to guess whether you have demultiplexed data or not. Usually, if its illumina data, the casava pipeline would have been run on your data. Confirm with your sequencing facility.

ADD REPLYlink written 5.0 years ago by geek_y10k
gravatar for RamRS
5.0 years ago by
Houston, TX
RamRS25k wrote:

Disclaimer: I know nothing about multiplexing, I'm addressing this as a string manipulation problem.

This might be addressed by framing a regex for fastq-grep that allows for one mismatch and one mismatch alone. I'm assuming you're looking at one possible mismatch for each of the two barcodes.

I'll address dealing with one, and then we can look at combining two of these.

Let's say your barcode is ATCACG. You wish to allow one mismatch. The possible barcodes then are:


and the cumulative expression is:


And if the two barcodes are separated by a :, you can just make it work by separating two such expressions with a [:]

Let me know if any of my assumptions is mistaken.

ADD COMMENTlink modified 9 months ago • written 5.0 years ago by RamRS25k

I think this will work but it would be interesting to know if this is performant at the scale of large fastq files. Regular expressions can exhibit big variations in performance, different patterns with identical effects can perform at very different speeds.

I think tools like say cutadapt and trimmomatic could be also used used separately for each adapter. Mothur also has an adaptor splitting and filtering command. And there are some dedicated tools for this (although lately since casava performs the job as well these tools have fallen off the radar).

ADD REPLYlink written 5.0 years ago by Istvan Albert ♦♦ 82k

Thank you very much for your reply. I will try this cumulative expression in combination with fastq-grep. I'm also curious to see how the performance of this combination will be.

ADD REPLYlink written 5.0 years ago by pat24020
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1260 users visited in the last hour