Question: what is the fastest way to extract reads of a specific barcode from a fastq
0
gravatar for b10hazard
4 weeks ago by
b10hazard20
United States
b10hazard20 wrote:

Illumina's bcl2fastq tool generates fastqs for barcodes that were not specified in the sample sheet. The files are named:

  • Undetermined_S0_L001_R1_001.fastq.gz
  • Undetermined_S0_L001_R2_001.fastq.gz
  • Undetermined_S0_L001_I1_001.fastq.gz

So there is a fastq for read1, read2, and the barcode read (index1) and they are all ordered the same. My question is... What is the fastest way to get a specific barcode from this file? The best thing I can come up with is to iterate through it using python and check the index fastq for the barcode I want. Pseudocode would be something like...

barcode_of_interest = 'AGAGAGAG'
reads_of_interest = list()
for read1, read2, index1 in zip(gzipreader(Undetermined_S0_L001_R1_001.fastq.gz), gzipreader(Undetermined_S0_L001_R2_001.fastq.gz), gzipreader(Undetermined_S0_L001_I1_001.fastq.gz)):
    if index1 == barcode_of_interest:
        reads_of_interest.append((read1, read2))

This could work, but what if I wanted to do this faster? Is there anyway to index the read1 and read2 files in advance and use the positions in the index fastq to make extracting specified barcodes faster? Does fadix do this? Or is there any other tool out there that can do this faster than python?

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by b10hazard20

demuxbyname.sh from BBMap suite.

$ demuxbyname.sh

Written by Brian Bushnell
Last modified May 1, 2019

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Opposite of muxbyname.
This will crash if the number of open file handles is too high (typically over 200 or so, depending on the system).
In that case, please use demuxbyname2.sh which is slightly slower but only writes to 1 file at a time.

Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Something along the lines of:

$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT,TAAGGCGA,...
outu=filename
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax69k

There is also a previously posted solution here that uses deML program : A: Demultiplexing Illumina data

ADD REPLYlink written 4 weeks ago by genomax69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 742 users visited in the last hour