Question: Demultiplex Illumina run using custom index configuration
0
gravatar for alba.rodriguezmeira
5 weeks ago by
Oxford
alba.rodriguezmeira0 wrote:

Hi all,

I am trying to demultiplex an Illumina run in which I have introduced barcode sequences in a custom configuration:

R1 - 6 bp (barcode 1) + 144 bp
R2 - 6 bp (barcode 2) + 144 BP

index read - i7 (8 bp)

I have used bcl2fastq to introduce the sequences of each barcode (barcode1+barcode2+i7) in the header of the read.

Bcl2fastq options:

Read1StartFromCycle,7,,,,,,
Read2StartFromCycle,7,,,,,,
Read1UMILength,6,,,,,,
Read2UMILength,6,,,,,,
Read1UMIStartFromCycle,1,,,,,,
Read2UMIStartFromCycle,1,,,,,,

Fastq line example:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:**AACGGT**+**TCCTTA** 1:N:0:**CTAAGTCATG**

CTTAACCCCTCCTCCCAGAGACCCCAGTTGCAAACCAGACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGCGTTTCTGTCATCCAAATACTCCACACGCAAATTTCCTTCCACTCGGATAAGATGCTGAGGAGG

+

CCFFFFGGGGGGGGGGHHGHGHGHGGGHHHHHHHHGHHHGHHHGHHHGGGGGHHHHHHHGGGHHGHHHGHHHHHHHHGGGHHGGGGHHHHHHHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHGGGGGGHHHHHHHFHGFFG

However, I can't find any suitable tool to further demultiplex these reads into individual fastq files corresponding to each unique barcode combination. Ideally, I would provide a sample sheet containing a sampleID and unique barcode combination (barcode1+barcode2+i7), and get individual fastq files named with the sampleID provided.

Any help/comments would be highly appreciated!

sequencing next-gen • 131 views
ADD COMMENTlink modified 4 weeks ago by genomax80k • written 5 weeks ago by alba.rodriguezmeira0

You can try using demuxbyname.sh from BBMap suite. Run the program without any options and look at the in-line help. Give it a try and see if you can figure this out. Otherwise I will do some more testing later when I have time.

$ demuxbyname.sh

Written by Brian Bushnell
Last modified Jan 7, 2020

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Allows unlimited output files while maintaining only a small number of open file handles.

Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Alternately:
demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
This will demultiplex by the substring after the last whitespace.

demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t
This will demultiplex by the first 8 characters of read names.

demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f
This will split on colons, and use the last substring as the name; useful for
demuxing by barcode for Illumina headers in this format:
@A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax80k

A second suggestion is omit moving the inline barcodes to fastq headers by removing bcl2fastq options you listed above.

Then use sabre (https://github.com/najoshi/sabre ) to demultiplex the data.

This will definitely work.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax80k

Hi!

Thanks so much for the response. I'll give it a try to demuxbyname.sh but my feeeling is that I'll have to first re-format the headers to get all barcodes in the right position, rather than how they are at the moment:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:AACGGT+TCCTTA 1:N:0:CTAAGTCATG

Possibly reformating to something like this would potentially work:

@M01913:344:000000000-CGVBP:1:1101:17206:1578 1:N:0:CTAAGTCATG+AACGGT+TCCTTA

I am not super familiar with awk/sed so I wouldn't know how to easily reformat the header in that sense. Any comments would be super welcome!

Unfortunately, sabre only supports the same barcode in forward and reverse reads for paired-end sequencing so that wouldn't work in this case (my R1 and R2 barcodes are always different).

Thanks!

ADD REPLYlink written 4 weeks ago by alba.rodriguezmeira0
1
gravatar for genomax
4 weeks ago by
genomax80k
United States
genomax80k wrote:

You should be able to use demuxbyname.sh this way. I made a small dummy file.

$ more dem.fq
@HISEQ:267:CAAV9ANXX:4:1101:10050:2218:AACGGT+TCCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2219:AATTGT+TCGGTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTGGGGCCCCAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2220:TTCGGT+GGCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF

$ demuxbyname.sh -Xmx5g in=dem.fq out=out_%.fq delimiter=: column=8

$ ls out*
out_AACGGT+TCCTTA 1.fq  out_AATTGT+TCGGTA 1.fq  out_TTCGGT+GGCTTA 1.fq

This method has an unfortunate effect of introducing a space in the filename because we are using the column 8 (where your UMI are) and that has a space after the index sequences.

You can take care of that using this loop that will remove spaces in the file names at the end

$ find . -type f -name "* *.fq" -exec bash -c 'mv "$0" "${0// /_}"' {} \;
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by genomax80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour