Question: Demultiplexing fastq.gz files
1
gravatar for gr1
10 months ago by
gr110
gr110 wrote:

Hi all,

I was recently given a data set (run through Illumina in 2014) to merge with a set run through Illumina this year. The set I just received was not demultiplexed and was dual barcoded. I know that Illumina's bcl2fastq can handle dual barcoded sets and demultiplex, but I don't have any of the BaseCall data it typically uses to demultiplex. I only have .fastq.gz files and a mapping file to work with. Does anyone know if I can still use the bcl2fastq without the BaseCall data, or if there is something that would work better?

tldr; I need to demultiplex dual barcoded .fastq.gz files

demultiplexing illumina • 2.4k views
ADD COMMENTlink modified 10 months ago by Istvan Albert ♦♦ 77k • written 10 months ago by gr110

Is the barcode included in fastq headers?

zcat file.fastq.gz | head

Or are they provided as separated files?

ADD REPLYlink written 10 months ago by h.mon16k

The barcode is included in the fastq headers:

    @BAMBINO-M01918:72:000000000-A8APG:1:1101:14359:1396 1:N:0:
TNGGGAATCTTCCGCAATGGGCGAAAGCCNNNCNGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
C#>>AABFFFFFGGGGGGGGGGGGGGGHH###B#BB######################################################################################################################################################################################################################
ADD REPLYlink modified 10 months ago • written 10 months ago by gr110

Unless you have two separate files that contain the index sequences you may be out of luck. Those files will likely have I1/I2 in their names.

See the WikiPedia fastq entry of where you should have seen the Illumina index sequence in the fastq header. Index sequence should have been present in this part of the header 1:N:0: at the end.

BTW: This is 16S data from a MiSeq run?

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax52k

I have four files total for the same run: a read 1, read 2, read 3 and read 4, though I'm not sure that's what you mean.

Also yes, this was a 16S MiSeq.

ADD REPLYlink written 10 months ago by gr110

So here is what you likely have. File 2 = Index 1 and File 3 = index 2. Look in all files to make sure the reads match the expected length of the read/index sequences. File 1 = Read 1 and File 4 = Read 2.

Use extract_barcodes.py script from Qiime package to process these files, if you intend to use Qiime for other analysis.

If you just need the data demultiplexed then try FastqMultx.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax52k

Awesome! I was intending to use QIIME for the rest of my analysis, so then I can try the FastqMultx and then extract the barcodes with the QIIME python command?

ADD REPLYlink written 10 months ago by gr110
1
gravatar for genomax
10 months ago by
genomax52k
United States
genomax52k wrote:

You can't use bcl2fastq to demux standalone fastq.gz files. You can take a look at this thread (Demultiplexing reads with index present in the labels ) to get inspiration on how to do the demultiplexing of your dataset.

Edit: If you have in-line index sequences then a different solution would be needed. If you have standard Illumina datafiles (post what you get from @h.mon's command) then demuxbyname.sh from BBMap will work as indicated in the thread in my answer above and the original solution referenced there.

ADD COMMENTlink modified 10 months ago • written 10 months ago by genomax52k

I believe I have standard Illumina datafiles (I replied to @h.mon above with the first section of the head command), so I'll have a look at the BBMap. Thanks!

Edit: So my head command looks similar to the "Demultiplexing reads with index present in the labels" link you sent me, with both of our barcodes in our indexes. So then, I'm guessing I'm free to use the BBMap script?

ADD REPLYlink modified 10 months ago • written 10 months ago by gr110

Also, it's unclear if the "Demultplexing reads with index present in the labels" poster has dual barcoded samples, I'm guessing that if I can use the BBMap script, that would be something I would need to deal with downstream?

ADD REPLYlink written 10 months ago by gr110

See my comment above.

ADD REPLYlink written 10 months ago by genomax52k
1
gravatar for Istvan Albert
10 months ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

You can't use bcl2fastq for this.

The bad news is that there might not be a tool to do as it is a task that is usually handled at the instrument level so there is less of a need to do it.

You would probably need to use a tool designed for cutting adapters to filter the reads:

http://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads

if it is a single end sequencing then this might be sufficient. If the reads are paired-end you'd probably need to process/reconcile both pairs to keep them in sync.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Istvan Albert ♦♦ 77k

Ah, well it makes sense then that it's been so hard to remedy the situation. It looks like I'll need to use that cutadapt guide you included and maybe do some stitching. Thank you!

ADD REPLYlink written 10 months ago by gr110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 981 users visited in the last hour