Question

Processing Barcoded Hiseq Data

0

Entering edit mode

13.2 years ago

bbio ▴ 90

I recently received some data from an Illumina HiSeq lane, in which the reads are all supposed to start with one of ten different 6bp barcodes. But when I try to split up the data by barcode, more than half of the reads don't actually match any of the barcodes. I imagine that this is due to the relatively high error rate of HiSeq during the first few cycles.

How do you usually deal with this? I assume one option would be some sort of fuzzy matching from real sequence to barcode (e.g. allow for one different nt), although I'm afraid this might introduce a whole new of bias. Is there any software for this out there?

illumina hiseq barcode • 4.2k views

ADD COMMENT • link updated 13.2 years ago by Madelaine Gogol 5.3k • written 13.2 years ago by bbio ▴ 90

score 1 · Answer 1 · 2012-05-15

1

Entering edit mode

13.2 years ago

Arun 2.4k

Just to be clear, the Hi-seq 2000 has info regarding barcode on each fastq read's header line and not as part of the sequence as it used to be in GAII. Are you extracting barcodes from header? This document might help: http://biowulf.nih.gov/apps/CASAVA1_8_Changes.pdf

If you do this right, then did you try searching for barcodes with 1 mismatch? Barcodes are supposed to have at least 2 mismatches between each other. You might be able to recover some?

ADD COMMENT • link 13.2 years ago by Arun 2.4k

0

Entering edit mode

None of the headers appear to contain a barcode, so I think this isn't the problem. I didn't try to allow for mismatches yet - this is what I meant by fuzzy matching above.

ADD REPLY • link 13.2 years ago by bbio ▴ 90

0

Entering edit mode

Which version of CASAVA was used for your FASTQ? Rather, what is your sequencer? Even better, could you paste 1 whole read including header, sequence and quality here?

You don't need to do fuzzy matching as the maximum number of mismatches with which you could safely identify the barcode is just 1. For example in perl, you could just write a subroutine:

sub hd { 
    length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ tr[\0][\0] ) 
}

and then call hd(str1, str2). The value it returns tells you the number of mismatches, and you can safely consider up to 1 mismatch, I believe. If not, I guess some one would correct me.

ADD REPLY • link 13.2 years ago by Arun 2.4k

0

Entering edit mode

Thank you for the suggestions, I actually just ended up using the fastx barcode splitter with the --mismatch flag.

ADD REPLY • link 13.2 years ago by bbio ▴ 90

score 1 · Answer 2 · 2012-05-15

1

Entering edit mode

13.2 years ago

Madelaine Gogol 5.3k

There's the fastx barcode splitter.

ADD COMMENT • link 13.2 years ago by Madelaine Gogol 5.3k

1

Entering edit mode

I had an unsuccessful attempt at using this before, but after your post I built the latest version from source instead of trying to use the precompiled binaries and that finally worked. Thanks!

ADD REPLY • link 13.2 years ago by bbio ▴ 90