I have a illumina hiseq2500 run with Forward and Reverse files. Each file contains many samples with two indexes located at the read label, as in:
@GHAY-HISEQ2:5:2308:17910:42054#CCTAGAAT-AGGTAAGG/2;1
I'm looking for a tool that could help me split each fastq sample inside this BIG fastq based on the 2 indexes. Anyone have ever faced this problem?
Thanks.
Could you paste example read record for each index?
Input fastq:
barcodes.txt:
code:
Seqkit is available here
There will be 4 output files: one for each bar code in barcodes.txt (with extension each barcode.fq-- eg. GAGAGTTG-AGGTAAGG.fq ). Reads with matching barcodes will be present in each barcode.fq.
Since there is only one match with barcodes.txt, only TTGCTGGA-ACCAACTG.fq will have reads inside it, rest fastq will be empty.
However, seqkit output is truncating text after +, for each read and IMHO, this is not an issue as the text is duplicated in header line (starting with @).