How to split a fastq file into each corresponding sample.fastq?
2
0
Entering edit mode
5.2 years ago
tiago211287 ★ 1.4k

I have a illumina hiseq2500 run with Forward and Reverse files. Each file contains many samples with two indexes located at the read label, as in:

@GHAY-HISEQ2:5:2308:17910:42054#CCTAGAAT-AGGTAAGG/2;1

I'm looking for a tool that could help me split each fastq sample inside this BIG fastq based on the 2 indexes. Anyone have ever faced this problem?

Thanks.

demultiplexing fastq illumina • 7.5k views
0
Entering edit mode

Could you paste example read record for each index?

0
Entering edit mode
@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

1
Entering edit mode

Input fastq:

$cat test.fq @GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1 NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT +GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1 B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee @GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1 NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT +GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1 B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee  barcodes.txt: $ cat barcodes.txt
TTGCTGGA-ACCAACTG
GTTCATTA-AGGTAAGG
GAGAGTTG-AGGTAAGG
CATCAGAC-AGGTAAGG


code:

$for i in$(cat barcodes.txt);do seqkit grep -r -p $i test.fq >$i.fq ;done


Seqkit is available here

There will be 4 output files: one for each bar code in barcodes.txt (with extension each barcode.fq-- eg. GAGAGTTG-AGGTAAGG.fq ). Reads with matching barcodes will be present in each barcode.fq.

Since there is only one match with barcodes.txt, only TTGCTGGA-ACCAACTG.fq will have reads inside it, rest fastq will be empty.

\$ cat TTGCTGGA-ACCAACTG.fq
@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee


However, seqkit output is truncating text after +, for each read and IMHO, this is not an issue as the text is duplicated in header line (starting with @).

4
Entering edit mode
5.2 years ago
GenoMax 123k

This is a pretty non-standard way of supplying index sequences (not sure how the /2;1 came into existence.

You can try demuxbyname.sh from BBMap suite.

0
Entering edit mode

The tool seems perfect for the job, but for some reason my barcodes do not match the reads and produce 0 size files. Can you tell me what am I doing wrong?

demuxbyname.sh in=s_5_#_sequence.fastq.gz out=sequence.test/out_%_#.fq prefixmode=f names=barcodes2.txt

TTGCTGGA-ACCAACTG
GTTCATTA-AGGTAAGG
GAGAGTTG-AGGTAAGG
CATCAGAC-AGGTAAGG

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

0
Entering edit mode

The problem seems to be related to this final "/1;1" or "/2;1", after removing it, the application works nicely.

0
Entering edit mode

Just a minor comment, after using a smallsampled.fastq, there was only 1 read match(including quality), but the summary reported 2 reads. Do you know if is this normal? Is the program considering the quality as 1 read?

Time:               0.215 seconds.
Bases Processed:    1000        0.00m bases/sec
Bases Out:    80

0
Entering edit mode

Tagging: Brian Bushnell to see if he has any comments

1
Entering edit mode
demuxbyname.sh in=file.fq prefixmode=f length=21 out=barcode_%.fq


That will demultiplex by the last 21 characters of the header without need to specify a barcode file. Alternatively:

demuxbyname.sh in=file.fq names=names.txt substringmode out=barcode_%.fq


That will allow the names to be substrings of the headers, so the trailing "/2;1" will be ignored.

0
Entering edit mode

Thank you very much Brian.

0
Entering edit mode

Just a minor comment, after using a smallsampled.fastq, there was only 1 read match(including quality), but the summary reported 2 reads. Do you know if is this normal? Is the program considering the quality as 1 read?

The number of reads out should be correct, and the "+" line is ignored, only read headers (starting with @) are considered. Can you show the contents of the output matching reads, and the first 8 lines of the subsampled input file (I know you've posted parts of the file, I just want to make sure exacly what the contents were in this case), the command, and ideally the full screen output as well, just to see if the reads were detected as paired/interleaved or something like that.

0
Entering edit mode

#In theory 25 sequences \/

wc -l test.fq
100 test.fq

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

demuxbyname.sh in=test.fq names=barcodes.txt substringmode out=test/barcode_%.fq

Input is being processed as unpaired
Time:               1.013 seconds.
Bases Processed:    1000        0.00m bases/sec
Bases Out:    1920


I think the reads out should be 50. But there was an unmatched:

@GHAY-HISEQ2:5:2308:1155:1935#NNNNNNNN-NNNNNNNN/1;0
NAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+GHAY-HISEQ2:5:2308:1155:1935#NNNNNNNN-NNNNNNNN/1;0
B[[[aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

1
Entering edit mode

After trimming the trailing /1;1 from the input first 8 lines, I got this:

demuxbyname.sh in=x.fq out=y%.fq suffixmode length=21
java -ea -Xmx1200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.DemuxByName in=x.fq out=y%.fq suffixmode length=21
Executing jgi.DemuxByName [in=x.fq, out=y%.fq, suffixmode, length=21]

Changed from ASCII-33 to ASCII-64 on input quality 66 for base N while prescanning.
Input is being processed as unpaired
Time:               0.060 seconds.
Bases Processed:    80  0.00m bases/sec
Bases Out:    80


...which is the output I expect. But in substring mode it looks like the output is correct, however everything is counted twice so the counts are wrong. Thanks for bringing that to my attention; I'll fix it soon.

0
Entering edit mode

0
Entering edit mode
5.2 years ago
Dan D 7.3k

The FASTX toolkit is way out of date, but I think FASTX Barcode Splitter may be what you need. As long as it works with your specific FASTQ format you're all set.

0
Entering edit mode

I was reading the manual of this tool but it seemed to me that it only look for barcodes at the read sequence, not in the labels. Or am I wrong?