Question: demultiplex a dataset when you have barcodes as a separate fastq
2
gravatar for IP
19 months ago by
IP530
Denmark/University of Copenagen
IP530 wrote:

Hi Biostars:

I have receive raw sequencing data from a collaborator, and the data is not demultiplexed. What I usually see on the fastq files that I have to analyse and demultiplex is the following:

Barcode + sequence

And then. one can use a software like barcode_splitter or demultiplex.py from the FourCseq package to demultiplex the samples.

However, now I have three fastq files, example:

One for the left reads:

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 1:N:0:
NTCCTTAAACCTCTGGTAGAATTTGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTATTGAT
+
#<DDDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHII

One for the right reads:

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 3:N:0:
AATAGACGCAATAAAAAATGATAAAGGGGAAATCACCACCAATCCCACAGAAATACAAACTACCATCAGAGAATA
+
DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

And, a last file with the barcode associated to the above read pair, note that the header is the same for the three entries of the fastq file.

@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 2:N:0:
GAGTGGAT
+
DCDDDIH<

Of course, I have a file with the barcode associated to each sample:

SAMPLE    INDEX     INDEX2
sample_6  GAGTGG    NA

I have try to look for software to demultiplex a fastq file when you have the data in this format (left_read.fastq, right_read.fastq and barcodes.fastq), however, I have not been able to find anything. I feel that I could solve this with python using pysam, but, since my colaborator is not a bioinformatician, I guess that there must be a tool for handling this.

So, long story short: is there a tool for demultiplexing datasets that are in the format: left_reads.fastq, right_reads.fastq, barcodes.fastq

best, and thanks for reading

ADD COMMENTlink modified 9 months ago by genomax62k • written 19 months ago by IP530
3

Ask them to have whoever did the sequencing demultiplex the files. The three files you're getting are the output of the demultiplexing software, but whoever ran it explicitly requested that output, since the default would be to demultiplex everything into separate files (i.e., what you and everyone else in the world actually wants). Don't waste time on this, have the person who produced the files do so correctly.

ADD REPLYlink written 19 months ago by Devon Ryan88k

If that is the answer, I assume that they have done something wrong, this is not a standard format for providing the data, right?

Whatever your answer is, thanks for repplying

ADD REPLYlink modified 19 months ago • written 19 months ago by IP530
1

There have been variations of Qiime (metagenomics) pipeline over the years where the barcode was expected to be in a separate file (which is what you have). Qiime package may have a utility program to demultiplex this data. Take a look there.

Provider has not done "something wrong" (especially if this was what was requested) but they can easily fix this (provided this is not an old dataset) and give you properly demultiplexed files.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax62k

Correct, the specified the --create-fastq-for-index-reads option and apparently didn't use a sample sheet. They need to just not specify that option and to use a sample sheet. Simply email those two sentences to them.

ADD REPLYlink written 19 months ago by Devon Ryan88k
1
gravatar for Charles Plessy
18 months ago by
Charles Plessy2.6k
Japan
Charles Plessy2.6k wrote:

If you do not find a program for demultiplexing three files at a time, perhaps you can append the barcodes at the beginning of the "left" reads, and then run a paired-end demultiplexer such as TagDust 2?

For an example on how to run TagDust 2, you can look at my tutorial on GitHub.

For how to paste the barcodes, maybe you can follow the example below:

$ cat toto.fq 
@toto1
AAAA
+
HHHH
@toto2
AAAA
+
HHHH

$ perl -nE '++$i % 2 == 0 ? print : say ""' toto.fq | paste -d '' - toto.fq 
@toto1
AAAAAAAA
+
HHHHHHHH
@toto2
AAAAAAAA
+
HHHHHHHH
ADD COMMENTlink written 18 months ago by Charles Plessy2.6k
1
gravatar for lelle
18 months ago by
lelle780
Berlin
lelle780 wrote:

I agree with Devon Ryan that it is probably easiest to get the data in the format you want from your sequencing provider, If that is not possible, you can use Flexbar which supports separate barcode reads.

ADD COMMENTlink written 18 months ago by lelle780

I only cursorily looked at flexbar page. Are you sure it can handle the situation here (where the barcode reads are in a separate file)? It does not seem to be the case per my quick look.

ADD REPLYlink written 18 months ago by genomax62k

Yes, with the -br option. I am not sure if it works when you have to barcode read files.

ADD REPLYlink written 18 months ago by lelle780
1
gravatar for genomax
9 months ago by
genomax62k
United States
genomax62k wrote:

A: Demultiplexing Illumina data has a solution for this. I am posting it here to create a cross-reference.

ADD COMMENTlink written 9 months ago by genomax62k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1897 users visited in the last hour