Question

Demultiplex Illumina run using custom index configuration

0

Entering edit mode

4.2 years ago

alba.rodriguezmeira • 0

Hi all,

I am trying to demultiplex an Illumina run in which I have introduced barcode sequences in a custom configuration:

R1 - 6 bp (barcode 1) + 144 bp
R2 - 6 bp (barcode 2) + 144 BP

index read - i7 (8 bp)

I have used bcl2fastq to introduce the sequences of each barcode (barcode1+barcode2+i7) in the header of the read.

Bcl2fastq options:

Read1StartFromCycle,7,,,,,,
Read2StartFromCycle,7,,,,,,
Read1UMILength,6,,,,,,
Read2UMILength,6,,,,,,
Read1UMIStartFromCycle,1,,,,,,
Read2UMIStartFromCycle,1,,,,,,

Fastq line example:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:**AACGGT**+**TCCTTA** 1:N:0:**CTAAGTCATG**

CTTAACCCCTCCTCCCAGAGACCCCAGTTGCAAACCAGACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGCGTTTCTGTCATCCAAATACTCCACACGCAAATTTCCTTCCACTCGGATAAGATGCTGAGGAGG

+

CCFFFFGGGGGGGGGGHHGHGHGHGGGHHHHHHHHGHHHGHHHGHHHGGGGGHHHHHHHGGGHHGHHHGHHHHHHHHGGGHHGGGGHHHHHHHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHGGGGGGHHHHHHHFHGFFG

However, I can't find any suitable tool to further demultiplex these reads into individual fastq files corresponding to each unique barcode combination. Ideally, I would provide a sample sheet containing a sampleID and unique barcode combination (barcode1+barcode2+i7), and get individual fastq files named with the sampleID provided.

Any help/comments would be highly appreciated!

next-gen sequencing • 2.3k views

ADD COMMENT • link updated 4.2 years ago by GenoMax 141k • written 4.2 years ago by alba.rodriguezmeira • 0

0

Entering edit mode

You can try using demuxbyname.sh from BBMap suite. Run the program without any options and look at the in-line help. Give it a try and see if you can figure this out. Otherwise I will do some more testing later when I have time.

$ demuxbyname.sh

Written by Brian Bushnell
Last modified Jan 7, 2020

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Allows unlimited output files while maintaining only a small number of open file handles.

Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Alternately:
demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
This will demultiplex by the substring after the last whitespace.

demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t
This will demultiplex by the first 8 characters of read names.

demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f
This will split on colons, and use the last substring as the name; useful for
demuxing by barcode for Illumina headers in this format:
@A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

A second suggestion is omit moving the inline barcodes to fastq headers by removing bcl2fastq options you listed above.

Then use sabre (https://github.com/najoshi/sabre ) to demultiplex the data.

This will definitely work.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Hi!

Thanks so much for the response. I'll give it a try to demuxbyname.sh but my feeeling is that I'll have to first re-format the headers to get all barcodes in the right position, rather than how they are at the moment:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:AACGGT+TCCTTA 1:N:0:CTAAGTCATG

Possibly reformating to something like this would potentially work:

@M01913:344:000000000-CGVBP:1:1101:17206:1578 1:N:0:CTAAGTCATG+AACGGT+TCCTTA

I am not super familiar with awk/sed so I wouldn't know how to easily reformat the header in that sense. Any comments would be super welcome!

Unfortunately, sabre only supports the same barcode in forward and reverse reads for paired-end sequencing so that wouldn't work in this case (my R1 and R2 barcodes are always different).

Thanks!

ADD REPLY • link 4.2 years ago by alba.rodriguezmeira • 0

score 2 · Accepted Answer · 2020-02-22

You should be able to use demuxbyname.sh this way. I made a small dummy file.

$ more dem.fq
@HISEQ:267:CAAV9ANXX:4:1101:10050:2218:AACGGT+TCCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2219:AATTGT+TCGGTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTGGGGCCCCAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2220:TTCGGT+GGCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF

$ demuxbyname.sh -Xmx5g in=dem.fq out=out_%.fq delimiter=: column=8

$ ls out*
out_AACGGT+TCCTTA 1.fq  out_AATTGT+TCGGTA 1.fq  out_TTCGGT+GGCTTA 1.fq

This method has an unfortunate effect of introducing a space in the filename because we are using the column 8 (where your UMI are) and that has a space after the index sequences.

You can take care of that using this loop that will remove spaces in the file names at the end

$ find . -type f -name "* *.fq" -exec bash -c 'mv "$0" "${0// /_}"' {} \;