Illumina bcl2fastq collisions and archives
1
0
Entering edit mode
6.2 years ago
blawney ▴ 10

TL;DR: Looking for way to disable errors caused by possible index collision in Illumina's bcl2fastq v2.20...OR...does anyone know where older versions (e.g. 2.15) are archived?

While upgrading our internal demultiplex pipeline, I decided to upgrade to Illumina's latest bcl2fastq v2.20 software. When running through a recent flowcell, it failed, reporting:

2018-01-26 14:46:34,158:INFO:b'std::exception::what: Barcode collision for barcodes: GTGAAAC, GTGATCC\n'
2018-01-26 14:46:34,158:INFO:b"By default, bcl2fastq allows 1 mismatch in each barcode. Barcodes with too few mismatches are ambiguous ( less than 2 times the number of mismatches plus 1). To reduce the number of allowed mismatches, use the command line option: '--barcode-mismatches'. Note that particularly for barcodes with only 1 mismatch, there is the danger that some reads will be written to the wrong sample due to errors in the barcode sequence.\n"

Due to high levels of multiplexing in some amplicon settings (as well as libraries prepared externally), it's very likely we will encounter situations (like above) where the indexes have an edit distance of 2 (not ideal, but that's just reality). In prior versions (we were using 2.15), demux (with mismatch tolerance=1) this was apparently permitted (don't see any warnings, even).

For 2.20, I cannot find any option to WARN, but not completely fail the demux process. Setting index mismatches to zero seems a bit too strict, and I'm comfortable with allowing 1 mismatch AND edit distances of >=2.

The alternative (while sticking with illumina software) is to simply fall back to 2.15/16, but I cannot find any downloads or archives. Anyone have any idea where those might be kept?

Thanks!

sequencing next-gen • 3.7k views
ADD COMMENT
0
Entering edit mode

Why don't you just specify the mismatch allowance with the --barcode-mismatches option, as suggested in the error message? All versions of bcl2fastq since at least 2.0 have this behavior.

ADD REPLY
0
Entering edit mode

That's not what the poster wants. He wants the index AGTAAAC to be recognized as a one-off of GTGAAAC; setting mismatches to zero will prevent that.

ADD REPLY
0
Entering edit mode

Ah, I totally misunderstood that, thanks.

ADD REPLY
0
Entering edit mode
6.2 years ago

I'm comfortable with allowing 1 mismatch AND edit distances of >=2.

Unfortunately, there doesn't seem to be a way to make the software comfortable with this. And, yes, it's dumb. It should be able to correctly assign "AGTAAAC", and just throw away reads with indices like "GTGATAC".

The only work around I can see is running the pipeline twice; omitting one offending index, then the other; then manually removing all the reads with ambiguous indices.

ADD COMMENT
0
Entering edit mode

Or run the pipeline only once without providing any indexes and then separate the reads from undetermined file.

ADD REPLY
0
Entering edit mode

Sure, but that involves re-inventing the wheel with regard to demultiplexing and dealing with all the potential one-offs. Someone else has already done that, and optimized it to be fast. It might be easier to use the software they already have to to the heavy lifting, and then trim away what isn't wanted. Assuming that the number of index clashes is modest.

ADD REPLY
0
Entering edit mode

Hmm, I'll think if there's a way I can massage that idea into a robust process- thanks for the suggestion. It might be a little tricky since it's run as a cron job (sequencer automatically loads to a local server) and I don't know beforehand if someone made a library that might cause this error. ( Sure, I could write a script to parse the Samplesheet.csv and do all this...but that seems like a lot of extra/unnecessary work!)

It does seem strange that they locked down that option. Their earlier versions didn't warn of collisions at all. I suppose I could modify the cpp file where it throws this exception, but that seems..ummm, risky.

At this point, falling back to v2.15 is looking pretty attractive (if I can find the source), especially since we do not have newer machines that require the newer software.

ADD REPLY
0
Entering edit mode

I can see why Illumina did this. We just started working with S4 flowcells and each lane generates ~2.5 Billion clusters and if you are not using the XP workflow then it is ~10 billion clusters per FC. When you have that much sequence (and the possibility of index sequence swaps with patterned FC's) being very conservative with index assignments is prudent.

While you don't want to depend on third party software for a production pipeline you may also want to take a look at deML available here as a possible option.

ADD REPLY

Login before adding your answer.

Traffic: 2345 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6