Demultiplexing fastq files with dual barcodes
3
3
Entering edit mode
7.6 years ago
Thomas Moody ▴ 30

I have HiSeq paired-end run data for an assay using our own dual barcodes to ID specific samples that I need to demultiplex before sample specific analysis.

I'm having trouble finding a tool for the job. The person who used to do this would basically use a single barcode demultiplexing software and run multiple times, but it seems a little messy to me.

I've tried to write my own tool, which works but feels pretty slow. I'm just using my own computer, so I have no idea how long this kind of process usually takes / if people generally do this on a dedicated machine / cluster.

Any suggestions?

EDIT:

I think I'm being unclear. I have samples that have already demultiplexed by the illumina machine by index. When designing our PCR primers, the actual primers were preceded by a golay or hamming barcode. The combination of forward and reverse primer barcode uniquely identifies a sample name and assay

So, I need to demultiplex based on the first 12 or 8 nt from the 5' end of each paired-end read.

next-gen-sequencing • 13k views
2
Entering edit mode
7.6 years ago

If you need to demultiplex based on barcodes within primer, you can try Checkout util (see docs). You'll need to specify a tab-delimited table with sample id, 5' barcode and reverse-complement of 3' barcode. As for the running time, I believe that I/O speed is a limiting factor here, not much to optimize.

0
Entering edit mode
7.6 years ago
Dan D 7.3k

CASAVA and bcl2fastq (both free downloads from Illumina) can handle dual barcodes. If you post the contents of the runParameters.xml file and your sample sheet, I can tell you what command line you should use.

EDIT:

Now that I know your updated requirements, it seems that in order to create a new FASTQ file for each unique barcode in a single script, you would need a lot of simultaneous filehandles open (or lots and lots and lots of RAM), both of which are problems on a desktop-grade machine if your distinct indices are as numerous as you claim.

Given that, I recommend iterating through the FASTQ data and dumping the data into a document/JSON storage database like CouchDB, RethinkDB, or the like. Basically, each unique barcode would have an array of FASTQ reads. Then, iterate through the indexes and dump the contents into separate FASTQ files. That will help you get around the RAM and filehandle hurdles.

0
Entering edit mode

Just to verify, I can use these tools to demultiplex .fastq files based on our own barcodes barcodes from our primers? The reads have been demultiplex from the indices already by the machine.

0
Entering edit mode

Oh, you're wanting to start with the FASTQ files and split them out based on the index value in the first line of each read, instead of starting from the basecall data?

If you're starting with FASTQ data, can you please post the output of zcat [fastq_file] | head (if the data are gzipped) or just head [fastq_file] (if not)?

0
Entering edit mode

I want split them out based on the first 12 nt (or 8nt) in each read, not from the illumina indices. Unfortunately there are all possible combinations of 12/8 on the table too (ie. 12nt on R1, 12nt on R2, 12nt on R1, 8nt on R2...)

0
Entering edit mode

0
Entering edit mode
7.6 years ago

BBMap has a tool for demultiplexing reads based on their barcodes, if you have them all in a fastq file with the barcode at the end of the read name. It's extremely fast.

demuxbyname.sh in=reads.fq out=%.fq suffix names=ACGTAC+GACTTG,ATATAT+CGCGCG


...where % gets replaced by the barcode, and names should list every literal barcode exactly as they appear in the read name, comma-delimited. If you have dual input files you can use in1, in2, out1, and out2. One (or two, for dual files) output file is created per barcode.

0
Entering edit mode

Updated my question, I need to demultiplex based on barcodes from my primers for the PCR (first 12/8 nt of each read)

0
Entering edit mode

Oh, that's a bit more tricky. I don't have an easy solution.

0
Entering edit mode

Any ideas on what a could baseline for this kind of process would be? My python solution takes about 8hours on my desktop when run on two ~15GB paired end files (30GB total). It works for now, just seems like someone could do better..

Thanks for the help either way!