Question: what is the fastest way to extract reads of a specific barcode from a fastq
gravatar for b10hazard
15 months ago by
United States
b10hazard30 wrote:

Illumina's bcl2fastq tool generates fastqs for barcodes that were not specified in the sample sheet. The files are named:

  • Undetermined_S0_L001_R1_001.fastq.gz
  • Undetermined_S0_L001_R2_001.fastq.gz
  • Undetermined_S0_L001_I1_001.fastq.gz

So there is a fastq for read1, read2, and the barcode read (index1) and they are all ordered the same. My question is... What is the fastest way to get a specific barcode from this file? The best thing I can come up with is to iterate through it using python and check the index fastq for the barcode I want. Pseudocode would be something like...

barcode_of_interest = 'AGAGAGAG'
reads_of_interest = list()
for read1, read2, index1 in zip(gzipreader(Undetermined_S0_L001_R1_001.fastq.gz), gzipreader(Undetermined_S0_L001_R2_001.fastq.gz), gzipreader(Undetermined_S0_L001_I1_001.fastq.gz)):
    if index1 == barcode_of_interest:
        reads_of_interest.append((read1, read2))

This could work, but what if I wanted to do this faster? Is there anyway to index the read1 and read2 files in advance and use the positions in the index fastq to make extracting specified barcodes faster? Does fadix do this? Or is there any other tool out there that can do this faster than python?

ADD COMMENTlink modified 15 months ago • written 15 months ago by b10hazard30

There is also a previously posted solution here that uses deML program : A: Demultiplexing Illumina data

ADD REPLYlink written 15 months ago by genomax90k
gravatar for genomax
15 months ago by
United States
genomax90k wrote: from BBMap suite.


Written by Brian Bushnell
Last modified May 1, 2019

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Opposite of muxbyname.
This will crash if the number of open file handles is too high (typically over 200 or so, depending on the system).
In that case, please use which is slightly slower but only writes to 1 file at a time.

Usage: in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Something along the lines of:

$ in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT,TAAGGCGA,...
ADD COMMENTlink modified 15 months ago • written 15 months ago by genomax90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1463 users visited in the last hour