Question

Estimating Index Hopping

0

Entering edit mode

6 months ago

Eliveri ▴ 350

I would like to calculate/estimate the rate of Index hopping for my sequencing runs with combinatorial dual indexes.

I have some output files bcl2fastq demultiplexing such as Stats.json

I am currently calculating rate of index hopping as the:

(Unknown_Barcode_Hit_Counts)/(Known_Barcode_Read_Counts + Undetermined_Read_Counts) *100

Where the unknown barcodes look like:

"UnknownBarcodes" : [
  {
   "Lane" : 1,
   "Barcodes" : {
    "GGGGGGGGGGGG+AGATCTCGGTGG" : 5157040,
    "GGGGGGGGGGGG+AGCTCTCGGTGG" : 53080,
    "GGGGGGGGGGGG+AGATCTCGGGGG" : 23560,
    "GGGGGGGGGGGG+GGGGGGGGGGGG" : 25540,
    "TGCTAGGCCATA+TGCTATTGGCTG" : 12000,

...

However I am not sure if this approach is correct for calculating index hopping. And whether I should be filtering out unknown barcodes containing GGGGGGGGGGGG or only count unknown barcodes which are both known indexes just paired incorrectly.

I know bclconvert outputs a Index_Hopping_Counts.csv but I currently do not have such a file and I am unsure if the Index_Hopping_Counts.csv gives you a rate of index hopping.

I have looked at some papers including paper but I could not quite grasp how index hopping is calculated.

bcl2fastq bclconvert • 611 views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 6 months ago by Eliveri ▴ 350

score 2 · Answer 1 · 2023-10-16

There's a program in BBTools that might be helpful here. It's designed to be run on the entire lane or just unknown bin of reads which have the barcodes in their headers, and gives results like this:

countbarcodes2.sh in=UNKNOWN.fastq.gz expected=expected.txt out=counts.txt

Barcodes        1000000         (87207 unique)
-codesPerRead   2
-delimiter      +
-expected       61511           (8 unique)
-unexpected     938489          (87199 unique)
-codesWithNs    3899            (804 unique)
-homopolymers   873             (1 unique)
--polyG 873
-badPair        10858           (15 unique)     (0.0109)%
-singleMatch    358815          (12443 unique)  (0.3588)%
-neitherMatch   568816          (74741 unique)  (0.5688)%

Barcode1        1000000         (16974 unique)
-expected       280341          (7 unique)
-unexpected     719659          (16967 unique)
-codesWithNs    2587            (298 unique)
-homopolymers   103110          (2 unique)
--polyA 4
--polyG 103106
-hDist 1,2,3+   29282   79196   505484

Barcode2        1000000         (11695 unique)
-expected       223212          (4 unique)
-unexpected     776788          (11691 unique)
-codesWithNs    3899            (339 unique)
-homopolymers   102092          (1 unique)
--polyG 102092
-hDist 1,2,3+   13976   40726   616095

Here, badPair is pairs where both barcodes are valid both the pair is invalid, and hDist is hamming distance from the closest valid barcode. So badPair is the rate of barcode hopping. Of course, you can't determine where a pair is bad if all possible pairs of valid indexes are valid pairs; it only works when at least SOME combinations are not valid (or not present in that run).

The input files look like this:

zcat UNKNOWN.fastq.gz | head -4
@A00178:519:HMC5WDSX7:3:1101:1362:1000 1:N:0:AATAAAACAT+CATAAACCGT
ANGAGATTACGACCATCCTGATCCAGATACTCCACGACAACAGCTTCATGCCGAACATCTTCACACACATAAACGCGAGCCACCTCCCCAGCGACATCGCCACCAGAAGCAAAGTTAAATCGCACAACATGTATAAGAACCACCTGATCAT
+
:#:FF,F,F:FF:F:F,,:F,FFFFF,F,,FF:,:FF,,F,,,,FF,FF,F:,,,,,F:F,,:,,:::FF,FFFF,,,,:,F:,:F,,,FFFF:,,FFFFF,::,F:,F:,,,F,FF,::,,,:::,FF:,F,:::FF:,:,,,,,FF::,

head expected.txt
GCGTATTATC+GCGTCATTGT
GGTCATTATC+GCGTCATTGT
AAGGACACAT+AAGACACGGG
GCGTCATTAT+CGTCATTGGG
CTGGAGTAAT+GCGTCATTGT
GATACTGGAT+GCGTCATTGT
GCGTCATTAT+AAGGACACGT
TCTTGACGAT+GCGTCATTGT

You don't have to have an expected file, but otherwise you won't get complete results.

You can then run AnalyzeBarcodes on the output "counts.txt" file and get this:

cat leftStats.txt

#Name   Total   BadPair
GCGTCATTAT      75579   20
TCTTGACGAT      70377   3280
AAGGACACAT      54411   5399
GATACTGGAT      47489   2061
GCGTATTATC      14490   45
GGTCATTATC      11728   35
CTGGAGTAAT      6267    18

This is the number of times the barcode appeared total, and the number of times in a "bad pair" configuration, explained above. AnalyzeBarcodes is harder to run though because there's no shellscript for it (since I never though anyone else would use it), but basically, look at the java command emitted for countbarcodes2.sh and replace "CountBarcodes2" with "AnalyzeBarcodes". For example, my command was:

java -ea -Xmx2g -cp /global/cfs/cdirs/bbtools/jgi-bbtools/current/ barcode.CountBarcodes2 in=UNKNOWN.fastq.gz expected=expected.txt out=counts.txt

...so my "analyze" command would be:

java -ea -Xmx2g -cp /global/cfs/cdirs/bbtools/jgi-bbtools/current/ barcode.AnalyzeBarcodes in=counts.txt expected=expected.txt

However, for AnalyzeBarcodes, you don't need to use a file generated by CountBarcodes2. It just needs to look like this:

head counts2.txt

#Barcodes       1000000
#Unique 87207
GGGGGGGGGG+GCGTCATTGT   34320
TCTTGACGAT+GGGGGGGGGG   19747
GCGTCATTAT+GGGGGGGGGG   18082
GGGGGGGGGG+TCTTGACGGT   16366
GCGTATTATC+GCGTCATTGT   14124
AAGGACACAT+GGGGGGGGGG   12734
GATACTGGAT+GGGGGGGGGG   12509
GGTCATTATC+GCGTCATTGT   11404

...so you can make it yourself from the data you already have. The first two lines with # symbols don't need to be present.

score 1 · Answer 2 · 2023-10-16

poly-G barcodes are not contributing any useful information. So unless you have a mix of 1D and 2D barcodes in the same library those should be excluded. They represent no signal.

You could run bclconvert on this flowcell. It is backwards compatible with all sequencers. Index_Hopping_Counts.csv gives you read numbers where indexes appear to have hopped so you can calculate a rate from there.