Question: Demultiplexing fastq.gz files
0
gravatar for gr1
3 months ago by
gr10
gr10 wrote:

Hi all,

I was recently given a data set (run through Illumina in 2014) to merge with a set run through Illumina this year. The set I just received was not demultiplexed and was dual barcoded. I know that Illumina's bcl2fastq can handle dual barcoded sets and demultiplex, but I don't have any of the BaseCall data it typically uses to demultiplex. I only have .fastq.gz files and a mapping file to work with. Does anyone know if I can still use the bcl2fastq without the BaseCall data, or if there is something that would work better?

tldr; I need to demultiplex dual barcoded .fastq.gz files

demultiplexing illumina • 422 views
ADD COMMENTlink modified 3 months ago by Istvan Albert ♦♦ 75k • written 3 months ago by gr10

Is the barcode included in fastq headers?

zcat file.fastq.gz | head

Or are they provided as separated files?

ADD REPLYlink written 3 months ago by h.mon9.8k

The barcode is included in the fastq headers:

    @BAMBINO-M01918:72:000000000-A8APG:1:1101:14359:1396 1:N:0:
TNGGGAATCTTCCGCAATGGGCGAAAGCCNNNCNGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
C#>>AABFFFFFGGGGGGGGGGGGGGGHH###B#BB######################################################################################################################################################################################################################
ADD REPLYlink modified 3 months ago • written 3 months ago by gr10

Unless you have two separate files that contain the index sequences you may be out of luck. Those files will likely have I1/I2 in their names.

See the WikiPedia fastq entry of where you should have seen the Illumina index sequence in the fastq header. Index sequence should have been present in this part of the header 1:N:0: at the end.

BTW: This is 16S data from a MiSeq run?

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k

I have four files total for the same run: a read 1, read 2, read 3 and read 4, though I'm not sure that's what you mean.

Also yes, this was a 16S MiSeq.

ADD REPLYlink written 3 months ago by gr10

So here is what you likely have. File 2 = Index 1 and File 3 = index 2. Look in all files to make sure the reads match the expected length of the read/index sequences. File 1 = Read 1 and File 4 = Read 2.

Use extract_barcodes.py script from Qiime package to process these files, if you intend to use Qiime for other analysis.

If you just need the data demultiplexed then try FastqMultx.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k

Awesome! I was intending to use QIIME for the rest of my analysis, so then I can try the FastqMultx and then extract the barcodes with the QIIME python command?

ADD REPLYlink written 3 months ago by gr10
1
gravatar for genomax
3 months ago by
genomax39k
United States
genomax39k wrote:

You can't use bcl2fastq to demux standalone fastq.gz files. You can take a look at this thread (Demultiplexing reads with index present in the labels ) to get inspiration on how to do the demultiplexing of your dataset.

Edit: If you have in-line index sequences then a different solution would be needed. If you have standard Illumina datafiles (post what you get from @h.mon's command) then demuxbyname.sh from BBMap will work as indicated in the thread in my answer above and the original solution referenced there.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax39k

I believe I have standard Illumina datafiles (I replied to @h.mon above with the first section of the head command), so I'll have a look at the BBMap. Thanks!

Edit: So my head command looks similar to the "Demultiplexing reads with index present in the labels" link you sent me, with both of our barcodes in our indexes. So then, I'm guessing I'm free to use the BBMap script?

ADD REPLYlink modified 3 months ago • written 3 months ago by gr10

Also, it's unclear if the "Demultplexing reads with index present in the labels" poster has dual barcoded samples, I'm guessing that if I can use the BBMap script, that would be something I would need to deal with downstream?

ADD REPLYlink written 3 months ago by gr10

See my comment above.

ADD REPLYlink written 3 months ago by genomax39k
1
gravatar for Istvan Albert
3 months ago by
Istvan Albert ♦♦ 75k
University Park, USA
Istvan Albert ♦♦ 75k wrote:

You can't use bcl2fastq for this.

The bad news is that there might not be a tool to do as it is a task that is usually handled at the instrument level so there is less of a need to do it.

You would probably need to use a tool designed for cutting adapters to filter the reads:

http://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads

if it is a single end sequencing then this might be sufficient. If the reads are paired-end you'd probably need to process/reconcile both pairs to keep them in sync.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Istvan Albert ♦♦ 75k

Ah, well it makes sense then that it's been so hard to remedy the situation. It looks like I'll need to use that cutadapt guide you included and maybe do some stitching. Thank you!

ADD REPLYlink written 3 months ago by gr10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1379 users visited in the last hour