Question

Tool:Introducing CrossBlock, a BBTool for removing cross-contamination

2

Entering edit mode

7.2 years ago

Brian Bushnell 20k

Illumina reads typically have short barcodes of around 8bp. This is fine when you are sequencing a couple of people with unamplified WGS on a full flowcell. However, Illumina platforms have a non-insignificant rate of misassigned barcodes. The reason for this is still not clear; I suspect that some of it is sequencing error, some is impure reagents, and some is adapters breaking off, floating around, and ligating to reads from the wrong library. Regardless, there are different rates of crosstalk on different platforms. HiSeq 2500 seems to have much higher crosstalk than NextSeq, but it's difficult to validate because different runs give different results. But currently, JGI is operating under the assumption that NextSeq gives the lowest crosstalk of all Illumina platforms, and JGI sequences crosstalk-sensitive things on NextSeq despite its much lower quality compared to HiSeq 2500.

JGI does a lot of single-cell sequencing. These cells are lysed and MDA-amplified prior to sequencing; the result is an exponential range of coverage, which is very spiky. If you are just sequencing a single organism in a run, it doesn't matter. But, JGI sequences 92 individual single cells on a 96-well plate, all multiplexed together. If there is no crosstalk, that's fine; you get 92 kind of bad assemblies (hopefully 60% genome recovery for each well). But, there is a significant amount of crosstalk. This causes huge problems with assembly - even a 0.01% rate of crosstalk can result in 50% or more of non-target genome in your assembly, due to MDA's spikiness.

0.01% crosstalk is not important when you multiplex 10 humans, and only care about heterozygous or homozygous calls (though of course it is still crucial when looking for low-allele-fraction variants). But for single-cell sequencing, it is deal-breaking. The current best single-cell assembler (for Illumina reads) is Spades. It can handle MDA bias, which will yield 1x coverage in some places, and 100,000x coverage in other places. That means that 0.01% crosstalk will give 10x coverage from a different, multiplexed sample, to all other samples. Meaning, they will all assemble the same contig, which was derived from some other organism. So, you get false results.

This is a fundamental limitation of current technology. Reagents are impure (meaning, your adapters do not have 100% the barcodes you expect), sequencing platforms are inaccurate (Illumina base-calling is very sensitive to leading and trailing bases; with an 8-bp barcode, you basically get 6 "decent" bases) and, as far as I can tell, adapters do in fact break off and ligate to something else.

There is no overall solution to this. However! If you are doing multiplexed single-cell sequencing on Illumina platforms, I can recommend this:

1) Allow zero barcode mismatches when demultiplexing. This is absolutely crucial. You will, of course, and up with far more unbinned reads, but that's just the price of correctness.

2) Use NextSeq. In our tests, it has yielded the lowest crosstalk rate of NextSeq/HiSeq2500/Miseq. The error rate is vastly higher than HiSeq2500, of course, but in this situation crosstalk is more important.

3) Run CrossBlock. In synthetic tests, it eliminates 100% of contaminant contigs, with a false-positive removal rate of 0.03% (ignoring contigs under 500bp). This assumes that you multiplexed different organisms; with identical organisms, the false-positive rate will increase. Still, it can usually deal with 2-3 copies of an organism with no false positive removals. More than that is dicey. It will remove some contigs, but they will still be present somewhere. In practice, I have found that CrossBlock retains contigs somewhere (meaning, at least one copy of a sequence exists) even when there are 20 copies of the same organism.

What does CrossBlock do?

It compares coverage of contigs from the library that generated them, to coverage from all other libraries. If the coverage from other libraries is dramatically higher, a contig is considered a contaminant. It's quite simple.

When should you use CrossBlock?

You should always use CrossBlock when dealing with different organisms, multiplexed together, where there is spiky coverage (single-cell, but possibly other situations).

When should you not use CrossBlock?

Most of the time. CrossBlock is only relevant to assembling novel genomes. If you are not doing assembly, don't use it. If you are not multiplexing different organisms, don't use it. Particularly, if you are multiplexing lots of things that might be the same organism... Don't use it; it can yield a lot of false-positive removals in that case. It's actually pretty good when you have 2-5 members of the same species on a plate. But if you already know you have a plate of 96 cells that are all different strains of the same species, don't use CrossBlock.

crossblock Contamination mda bbmap • 3.7k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 7.2 years ago by Brian Bushnell 20k

score 0 · Answer 1 · 2017-11-10

Hi Brian,

I'm trying to use CrossBlock to eliminate contaminant contigs from some transcriptome data that I have, and I'm running into some problems. I can't find a ton of troubleshooting info, so I thought I'd ask you directly:

I set up a run for a test set of three libraries as you explain in this post: http://seqanswers.com/forums/archive/index.php/t-50414.html .The program runs, but it stops before it produces a set of clean contigs for each library. It creates an output directory, and it writes a file with the suffix covstats0.txt for each submitted library. Apart from that, and the log file that I created, there is no other discernible output.

Do you have any ideas about what could be going awry? I can provide the log file if that would be useful, but I scoured it for errors and didn't see any.

Thanks very much for your time,

Andrew Wood