Can Illumina bcl2fastq use only one index for demultiplexing dual index sequencing data?
4
0
Entering edit mode
4.8 years ago
chen ★ 2.4k

Hi,

For Illumina sequencing data with dual indexes (151 read1 + 8 index1 + 8 index2 + 151 read2), conventional demultiplexing method is to set both index1 and index2 for each sample.

However, for some data (i.e. UMI in index2), only index1 is fixed, and index2 is random. So there is no way to set both index1 and index2 in the sample sheet.

For such case, is it applicable to set only index1 to demultiplex data? Seems bcl2fastq doesn't support such settings. Does any have any experience?

bcl2fastq demultiplexing index • 8.8k views
2
Entering edit mode
4.8 years ago
GenoMax 120k

bcl2fastq handles UMI's that are part of Read 1/2. I am not sure how you are getting them in index 2.

A couple of possibilities come to mind.

1. You could set a use-bases mask such as --use-bases-mask Y*,I8,n*,Y*. This would demux the data based on index 1 but still retain the sequence of index 2 in read headers. You can then parse the index sequences in the header and create a new SampleSheet.csv to re-demux original data or use something else to do a second round of demux with data from round 1.

2. You could leave the data non-demultiplexed creating separate files for index reads. Then demux the data afterwards using reads 2 and 3.

Will random indexes be shared by more than one index 1's?

1
Entering edit mode

Yes, different samples with different index 1 can have same random index 2.

Currently I demultiplex all data to Undetermined, and split the FASTQ file by its index 1. But it's time consuming.

I may try to alter bcl2fastq source code to support index 1 based demultiplexing for dual index data.

0
Entering edit mode

How many random indexes are expected in index 2 generally (tens, hundreads or more)? Doing #1 in my comment above may be faster, if the index 2 size is manageable.

0
Entering edit mode

thousands or even more

0
Entering edit mode

I think doing #1 is probably going to be the fastest option. One can easily collect index combinations from the resulting files from round 1 of demultiplexing. Since you work with NovaSeq the data files must be huge.

0
Entering edit mode

Biologically speaking, how are you even getting the UMI in index read 2?

1
Entering edit mode

Maybe it is something like this:

0
Entering edit mode

Yes, with customized primers

0
Entering edit mode

Ah, that'll definitely break Illumina's software.

1
Entering edit mode
4.6 years ago
Gabriel R. ★ 2.8k

You could simply use deML: https://grenaud.github.io/deML/

It is a maximum-likelihood demultiplexer algorithm that is designed to deal with incomplete or noisy data.

Hope this helps.

0
Entering edit mode

This is not noisy data but an unusual modification where the UMI is in the second index read.

0
Entering edit mode

it was a general statement rather than a comment about the nature of OP's data :-) just do not demultiplex with the second index and simply use the first one. That will give you the demultiplexing using only the information provided by the first index.

0
Entering edit mode
4.8 years ago
h.mon 34k

I am not sure this will work, but you can try bcl2fastq with the parameters --create-fastq-for-index-reads and --use-bases-mask Y151,I8,n8,Y151.

Worst case you will have to --create-fastq-for-index-reads and --use-bases-mask Y151,I8,I8,Y151, then join all reads from same index1 and use index2 as UMI.

0
Entering edit mode
4.6 years ago
petervangalen ▴ 170

You can specify which reads should be used for demultiplexing in RunInfo.xml, which may be more convenient than --use-bases-mask. I had a run with i7 (first index, Read#2) and i5 (second index, Read#3) but I only wanted to use i7 for demultiplexing.

1. Make a backup copy of RunInfo.xml, which is in the run folder with the SampleSheet etc.

2. Open RunInfo.xml and change the following:

<Read Number="3" NumCycles="8" IsIndexedRead="Y"/>

to

<Read Number="3" NumCycles="8" IsIndexedRead="N"/>

1. Update SampleSheet.csv so it has only one barcode column

2. Run bcl2fastq as you normally would

3. The output was demultiplexed by i7 (first index, Read#2) and contained fastq files for three reads:

…_R1_…fastq.gz for Read#1

…_R2_…fastq.gz for i5 (second index, Read#3) that I didn't want to use for indexing

…_R3_…fastq.gz for Read#4 (it was a paired-end run)

0
Entering edit mode

Did anyone try this using the i5 for demultiplexing?

0
Entering edit mode

It will work, if you want to ignore first index. What is the specific use case? Do you have an identical i7 index for all or was that read bad?

1
Entering edit mode

You are correct, I also had to remove the unnecessary columns from the Sample_Sheet. In this case, i7 is the "UMI" (for a subset of multiplexed samples) but you actually solved the problem in another thread. Since the sequence was so short it was masked when treated as a read and thus the added param --mask-short-adapter-reads 0 was needed.