Question

Demultiplexing fastq.gz files

4

Entering edit mode

7.8 years ago

gr1 ▴ 40

Hi all,

I was recently given a data set (run through Illumina in 2014) to merge with a set run through Illumina this year. The set I just received was not demultiplexed and was dual barcoded. I know that Illumina's bcl2fastq can handle dual barcoded sets and demultiplex, but I don't have any of the BaseCall data it typically uses to demultiplex. I only have .fastq.gz files and a mapping file to work with. Does anyone know if I can still use the bcl2fastq without the BaseCall data, or if there is something that would work better?

tldr; I need to demultiplex dual barcoded .fastq.gz files

illumina demultiplexing • 19k views

ADD COMMENT • link updated 7.8 years ago by Istvan Albert 102k • written 7.8 years ago by gr1 ▴ 40

0

Entering edit mode

Is the barcode included in fastq headers?

zcat file.fastq.gz | head

Or are they provided as separated files?

ADD REPLY • link 7.8 years ago by h.mon 35k

0

Entering edit mode

The barcode is included in the fastq headers:

    @BAMBINO-M01918:72:000000000-A8APG:1:1101:14359:1396 1:N:0:
TNGGGAATCTTCCGCAATGGGCGAAAGCCNNNCNGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
C#>>AABFFFFFGGGGGGGGGGGGGGGHH###B#BB######################################################################################################################################################################################################################

ADD REPLY • link 7.8 years ago by gr1 ▴ 40

0

Entering edit mode

Unless you have two separate files that contain the index sequences you may be out of luck. Those files will likely have I1/I2 in their names.

See the WikiPedia fastq entry of where you should have seen the Illumina index sequence in the fastq header. Index sequence should have been present in this part of the header 1:N:0: at the end.

BTW: This is 16S data from a MiSeq run?

ADD REPLY • link 7.8 years ago by GenoMax 152k

0

Entering edit mode

I have four files total for the same run: a read 1, read 2, read 3 and read 4, though I'm not sure that's what you mean.

Also yes, this was a 16S MiSeq.

ADD REPLY • link 7.8 years ago by gr1 ▴ 40

0

Entering edit mode

So here is what you likely have. File 2 = Index 1 and File 3 = index 2. Look in all files to make sure the reads match the expected length of the read/index sequences. File 1 = Read 1 and File 4 = Read 2.

Use extract_barcodes.py script from Qiime package to process these files, if you intend to use Qiime for other analysis.

If you just need the data demultiplexed then try FastqMultx.

ADD REPLY • link 7.8 years ago by GenoMax 152k

0

Entering edit mode

Awesome! I was intending to use QIIME for the rest of my analysis, so then I can try the FastqMultx and then extract the barcodes with the QIIME python command?

ADD REPLY • link 7.8 years ago by gr1 ▴ 40

score 2 · Accepted Answer · 2017-09-07

2

Entering edit mode

7.8 years ago

GenoMax 152k

You can't use bcl2fastq to demux standalone fastq.gz files. You can take a look at this thread (Demultiplexing reads with index present in the labels ) to get inspiration on how to do the demultiplexing of your dataset.

Edit: If you have in-line index sequences then a different solution would be needed. If you have standard Illumina datafiles (post what you get from @h.mon's command) then demuxbyname.sh from BBMap will work as indicated in the thread in my answer above and the original solution referenced there.

ADD COMMENT • link 7.8 years ago by GenoMax 152k

0

Entering edit mode

I believe I have standard Illumina datafiles (I replied to @h.mon above with the first section of the head command), so I'll have a look at the BBMap. Thanks!

Edit: So my head command looks similar to the "Demultiplexing reads with index present in the labels" link you sent me, with both of our barcodes in our indexes. So then, I'm guessing I'm free to use the BBMap script?

ADD REPLY • link 7.8 years ago by gr1 ▴ 40

0

Entering edit mode

Also, it's unclear if the "Demultplexing reads with index present in the labels" poster has dual barcoded samples, I'm guessing that if I can use the BBMap script, that would be something I would need to deal with downstream?

ADD REPLY • link 7.8 years ago by gr1 ▴ 40

0

Entering edit mode

See my comment above.

ADD REPLY • link 7.8 years ago by GenoMax 152k

score 1 · Accepted Answer · 2017-09-07

1

Entering edit mode

7.8 years ago

Istvan Albert 102k

You can't use bcl2fastq for this.

The bad news is that there might not be a tool to do as it is a task that is usually handled at the instrument level so there is less of a need to do it.

You would probably need to use a tool designed for cutting adapters to filter the reads:

http://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads

if it is a single end sequencing then this might be sufficient. If the reads are paired-end you'd probably need to process/reconcile both pairs to keep them in sync.

ADD COMMENT • link 7.8 years ago by Istvan Albert 102k

0

Entering edit mode

Ah, well it makes sense then that it's been so hard to remedy the situation. It looks like I'll need to use that cutadapt guide you included and maybe do some stitching. Thank you!

ADD REPLY • link 7.8 years ago by gr1 ▴ 40