I have receive raw sequencing data from a collaborator, and the data is not demultiplexed. What I usually see on the fastq files that I have to analyse and demultiplex is the following:
Barcode + sequence
However, now I have three fastq files, example:
One for the left reads:
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 1:N:0: NTCCTTAAACCTCTGGTAGAATTTGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTATTGAT + #<DDDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHII
One for the right reads:
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 3:N:0: AATAGACGCAATAAAAAATGATAAAGGGGAAATCACCACCAATCCCACAGAAATACAAACTACCATCAGAGAATA + DDDDDIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
And, a last file with the barcode associated to the above read pair, note that the header is the same for the three entries of the fastq file.
@JLK5VL1:840:HLKVHBCXX:1:1101:1489:2056 2:N:0: GAGTGGAT + DCDDDIH<
Of course, I have a file with the barcode associated to each sample:
SAMPLE INDEX INDEX2 sample_6 GAGTGG NA
I have try to look for software to demultiplex a fastq file when you have the data in this format (left_read.fastq, right_read.fastq and barcodes.fastq), however, I have not been able to find anything. I feel that I could solve this with python using pysam, but, since my colaborator is not a bioinformatician, I guess that there must be a tool for handling this.
So, long story short: is there a tool for demultiplexing datasets that are in the format: left_reads.fastq, right_reads.fastq, barcodes.fastq
best, and thanks for reading