demultiplex a dataset when you have barcodes as a separate fastq
3
3
Entering edit mode
4.4 years ago
IP ▴ 720

Hi Biostars:

I have receive raw sequencing data from a collaborator, and the data is not demultiplexed. What I usually see on the fastq files that I have to analyse and demultiplex is the following:

Barcode + sequence

And then. one can use a software like barcode_splitter or demultiplex.py from the FourCseq package to demultiplex the samples.

However, now I have three fastq files, example:

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 1:N:0:
NTCCTTAAACCTCTGGTAGAATTTGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTATTGAT
+
#<DDDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHII


@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 3:N:0:
AATAGACGCAATAAAAAATGATAAAGGGGAAATCACCACCAATCCCACAGAAATACAAACTACCATCAGAGAATA
+
DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII


And, a last file with the barcode associated to the above read pair, note that the header is the same for the three entries of the fastq file.

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 2:N:0:
GAGTGGAT
+
DCDDDIH<


Of course, I have a file with the barcode associated to each sample:

SAMPLE    INDEX     INDEX2
sample_6  GAGTGG    NA


I have try to look for software to demultiplex a fastq file when you have the data in this format (left_read.fastq, right_read.fastq and barcodes.fastq), however, I have not been able to find anything. I feel that I could solve this with python using pysam, but, since my colaborator is not a bioinformatician, I guess that there must be a tool for handling this.

So, long story short: is there a tool for demultiplexing datasets that are in the format: left_reads.fastq, right_reads.fastq, barcodes.fastq

demultiplex next-gen sequencing • 9.0k views
3
Entering edit mode

Ask them to have whoever did the sequencing demultiplex the files. The three files you're getting are the output of the demultiplexing software, but whoever ran it explicitly requested that output, since the default would be to demultiplex everything into separate files (i.e., what you and everyone else in the world actually wants). Don't waste time on this, have the person who produced the files do so correctly.

0
Entering edit mode

If that is the answer, I assume that they have done something wrong, this is not a standard format for providing the data, right?

1
Entering edit mode

There have been variations of Qiime (metagenomics) pipeline over the years where the barcode was expected to be in a separate file (which is what you have). Qiime package may have a utility program to demultiplex this data. Take a look there.

Provider has not done "something wrong" (especially if this was what was requested) but they can easily fix this (provided this is not an old dataset) and give you properly demultiplexed files.

0
Entering edit mode

Correct, the specified the --create-fastq-for-index-reads option and apparently didn't use a sample sheet. They need to just not specify that option and to use a sample sheet. Simply email those two sentences to them.

1
Entering edit mode
4.4 years ago
Charles Plessy ★ 2.8k

If you do not find a program for demultiplexing three files at a time, perhaps you can append the barcodes at the beginning of the "left" reads, and then run a paired-end demultiplexer such as TagDust 2?

For an example on how to run TagDust 2, you can look at my tutorial on GitHub.

For how to paste the barcodes, maybe you can follow the example below:

$cat toto.fq @toto1 AAAA + HHHH @toto2 AAAA + HHHH$ perl -nE '++\$i % 2 == 0 ? print : say ""' toto.fq | paste -d '' - toto.fq
@toto1
AAAAAAAA
+
HHHHHHHH
@toto2
AAAAAAAA
+
HHHHHHHH

1
Entering edit mode
4.4 years ago
lelle ▴ 830

I agree with Devon Ryan that it is probably easiest to get the data in the format you want from your sequencing provider, If that is not possible, you can use Flexbar which supports separate barcode reads.

0
Entering edit mode

I only cursorily looked at flexbar page. Are you sure it can handle the situation here (where the barcode reads are in a separate file)? It does not seem to be the case per my quick look.

0
Entering edit mode

Yes, with the -br option. I am not sure if it works when you have to barcode read files.

1
Entering edit mode
3.6 years ago
GenoMax 110k

A: Demultiplexing Illumina data has a solution for this. I am posting it here to create a cross-reference.