Demultiplex Illumina run using custom index configuration
1
0
Entering edit mode
4.2 years ago

Hi all,

I am trying to demultiplex an Illumina run in which I have introduced barcode sequences in a custom configuration:

R1 - 6 bp (barcode 1) + 144 bp
R2 - 6 bp (barcode 2) + 144 BP

index read - i7 (8 bp)

I have used bcl2fastq to introduce the sequences of each barcode (barcode1+barcode2+i7) in the header of the read.

Bcl2fastq options:

Read1StartFromCycle,7,,,,,,
Read2StartFromCycle,7,,,,,,
Read1UMILength,6,,,,,,
Read2UMILength,6,,,,,,
Read1UMIStartFromCycle,1,,,,,,
Read2UMIStartFromCycle,1,,,,,,

Fastq line example:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:**AACGGT**+**TCCTTA** 1:N:0:**CTAAGTCATG**

CTTAACCCCTCCTCCCAGAGACCCCAGTTGCAAACCAGACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGCGTTTCTGTCATCCAAATACTCCACACGCAAATTTCCTTCCACTCGGATAAGATGCTGAGGAGG

+

CCFFFFGGGGGGGGGGHHGHGHGHGGGHHHHHHHHGHHHGHHHGHHHGGGGGHHHHHHHGGGHHGHHHGHHHHHHHHGGGHHGGGGHHHHHHHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHGGGGGGHHHHHHHFHGFFG

However, I can't find any suitable tool to further demultiplex these reads into individual fastq files corresponding to each unique barcode combination. Ideally, I would provide a sample sheet containing a sampleID and unique barcode combination (barcode1+barcode2+i7), and get individual fastq files named with the sampleID provided.

Any help/comments would be highly appreciated!

next-gen sequencing • 2.3k views
ADD COMMENT
0
Entering edit mode

You can try using demuxbyname.sh from BBMap suite. Run the program without any options and look at the in-line help. Give it a try and see if you can figure this out. Otherwise I will do some more testing later when I have time.

$ demuxbyname.sh

Written by Brian Bushnell
Last modified Jan 7, 2020

Description:  Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Allows unlimited output files while maintaining only a small number of open file handles.

Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>

Alternately:
demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
This will demultiplex by the substring after the last whitespace.

demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t
This will demultiplex by the first 8 characters of read names.

demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f
This will split on colons, and use the last substring as the name; useful for
demuxing by barcode for Illumina headers in this format:
@A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT
ADD REPLY
0
Entering edit mode

A second suggestion is omit moving the inline barcodes to fastq headers by removing bcl2fastq options you listed above.

Then use sabre (https://github.com/najoshi/sabre ) to demultiplex the data.

This will definitely work.

ADD REPLY
0
Entering edit mode

Hi!

Thanks so much for the response. I'll give it a try to demuxbyname.sh but my feeeling is that I'll have to first re-format the headers to get all barcodes in the right position, rather than how they are at the moment:

@M01913:344:000000000-CGVBP:1:1101:17206:1578:AACGGT+TCCTTA 1:N:0:CTAAGTCATG

Possibly reformating to something like this would potentially work:

@M01913:344:000000000-CGVBP:1:1101:17206:1578 1:N:0:CTAAGTCATG+AACGGT+TCCTTA

I am not super familiar with awk/sed so I wouldn't know how to easily reformat the header in that sense. Any comments would be super welcome!

Unfortunately, sabre only supports the same barcode in forward and reverse reads for paired-end sequencing so that wouldn't work in this case (my R1 and R2 barcodes are always different).

Thanks!

ADD REPLY
2
Entering edit mode
4.2 years ago
GenoMax 141k

You should be able to use demuxbyname.sh this way. I made a small dummy file.

$ more dem.fq
@HISEQ:267:CAAV9ANXX:4:1101:10050:2218:AACGGT+TCCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2219:AATTGT+TCGGTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTGGGGCCCCAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF
@HISEQ:267:CAAV9ANXX:4:1101:10050:2220:TTCGGT+GGCTTA 1:N:0:AGTCAA
GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<F<B/<F/<BF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFBFFFFFFFFFFFFFBFBF<FFFFFFFFFFFFFF<BFFF/BF/FBF<B<<F/7BFBBFFFF/B/BF

$ demuxbyname.sh -Xmx5g in=dem.fq out=out_%.fq delimiter=: column=8

$ ls out*
out_AACGGT+TCCTTA 1.fq  out_AATTGT+TCGGTA 1.fq  out_TTCGGT+GGCTTA 1.fq

This method has an unfortunate effect of introducing a space in the filename because we are using the column 8 (where your UMI are) and that has a space after the index sequences.

You can take care of that using this loop that will remove spaces in the file names at the end

$ find . -type f -name "* *.fq" -exec bash -c 'mv "$0" "${0// /_}"' {} \;
ADD COMMENT

Login before adding your answer.

Traffic: 3082 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6