Demultiplex based on sequence, not headers
1
0
Entering edit mode
7 months ago
damt0320 • 0

Hi. Im trying to demultiplex a dataset but im having a couple of problems with my files. I have reallized that each pair of barcodes associated to each ID arent in the header but in the sequence. I have tried softwares like demuxbyname AND deML but these programs work with the header barcodes. I have 115 ID's, so im supposed to get 230 fastq files (forward and reverse) but when i get manually the barcodes of the header im getting more than 200 pair of barcodes. When i run demuxbyname with these barcodes im getting near 500 files, but of these 500 files there are about 230 that have a considerable size so I think that there are the files that I need. I think that the barcodes associated with the IDs are in the sequence and not in the header so im looking for a software able to demultiplex based on the sequence. I have paired end reads. I have seen the FASTX- barcode splitter but im not sure if this software allow to demultiplex paired end reads beacause the txt example file that have the barcodes only have 1 barcode per ID. Any help is appreciated. Thanks

demultiplex illumina demuxbyname • 328 views
0
Entering edit mode
7 months ago
GenoMax 99k

I think that the barcodes associated with the IDs are in the sequence and not in the header so im looking for a software able to demultiplex based on the sequence.

Are you sure? If so where are they located? At beginning of read? They could thus be thought of as unique molecular indexes (UMI) and you may be able to use umi-tools.

I have 115 ID's, so im supposed to get 230 fastq files (forward and reverse) but when i get manually the barcodes of the header im getting more than 200 pair of barcodes.

That is not unexpected. There are a combinatorial smattering of indexes in Illumina sequencing besides the expected indexes. You should stick with ones you know are there and ignore rest of the data/files.

Can you post a small example of data you have? zcat your_file.gz | head -8 (for both R1/R2 files) would be enough.

0
Entering edit mode

Hi genomax, thanks for your answer. Here is an example of what you ask me. for R1:

@M03485:29:000000000-J4R9F:1:1101:16949:2497 1:N:0:TAATTCGT+ATAGAGGC
CAATAGTCGCAGGAAGTAAAAGNCGTAANAAGGTCACCGTAGGTGAACCTGCGGTTGGATCATTACAAAAACAAAATGTTTGGGAAAAAAAAAAAGTCTTGCTTGTTCAAGATGTTTCCCAACACATTTTACACACCAACACTGTGTTAACGTTATTGTATTTTGGCGGTTTTCCCATTCAAGCGGACGTTTTGGTCACTCAGACCATAACAGCCGTGGGACCGCCAGCCTTATGCAAACTCAACTGTTTT
+
@M03485:29:000000000-J4R9F:1:1101:24366:5906 1:N:0:TAATTCGT+ATAGAGGC
ACGACACACACCAAGAGATCCGNTGTTGNAAGTTGTCACCATTAACAGTGTATCTCAGCCAAGATTCAGGTGTTTGTAACCACCGGGCCGCGCTGACCACTGGGCGAACCAGCAGCAGCAGCGACCCGAGAAACGGCACAGTGCACAGGGGTTCCACAGCGCAAGCGCTGGGTATCGGTAATGATCCAACCGCAGGTTCACCTACGGTTACCTTGTTACGACTTTTACTTCCTGCGACTAAGTAGATCGGA
+
AAA?3A@FAAAACEFGGGGGGG#BBABE#AAAFGHHHBBGFFG5FGGHGEGFHHGHFFHHFBFFHHHHG3GBGFF?GGFGHHGGDEEEGEGGEGGFHHFHGHHE/>CGGGHHHFHEHHHHFFCCGGGGGGGGHBCDCCGHHGFHHHHFC.-:-G/CFFFE@BGG@EG?FB-C/BFFD;.:AFFFFFFFFFFBBBBB9/BFFFFFFDA;;AFFEFFBBFFFFFF?FBB/FBFF9/BBBBBBFFBFFFBFAF;


for R2:

@M03485:29:000000000-J4R9F:1:1101:16949:2497 2:N:0:TAATTCGT+ATAGAGGC
GTTACACACACCAAGANATCCGNTGNTGANAGTTGTTTTCAATTAAGAAAAAAGACTTCACAGGAGATCATTTGTTACACAATTCAGATAAAAAAACAGTAGAGTTTGCATAAGGCTGGCGGTCCCACGGCTGATATGGTCTGAGTGACCAAAACGTCCGCATGAATGGGAAAACCGCCAAAATACAATAACGTTAACACCGTGTTGGTGTGTAAAATGTGTTGGGAAACATCTTGAACAAGCAAGACTTT
+
AA1AAFFDAA@AGFGG#AAAAF#BB#BBA#BBAFFHCEHGG2FGB1GHHHHHGGEB12DA1FBAAFGHH1FFG220DFEBG/GHFBH2F1B@FGE?EAFBEEBF1>1GFBGFHF1C1BE<EGC/C0GC@CCAC01?1FF?11011=<<1GBGEH0><<.@C-<A000:/G//;C/;:----/9FFF0000999;CBE0F/.-.9;A?--;9AA/;BFB//99FE?-ABE///9BBBBB9/---;BF--//9
@M03485:29:000000000-J4R9F:1:1101:24366:5906 2:N:0:TAATTCGT+ATAGAGGC
ACTTAGTCGCAGGAAGNAAAAGNCGNAACNAGGTACCCGTAGGTGAACCTGCGGTTGGATCATTACCGATACCTAGCGCTTGCGCTGTGGAACACCTGTGCACTGTGCCGTTTCTCGGGTCGCTGCTGCTGCTGGTTCGCCCAGTGGTCAGCGCGGCCCGGTGGATACAAACAGCTGAATCTTGGCTGAGATACACTGTTAATTGTTTCAACTTTCAACAACGGATCTCTTGGTGCGTGTCGTAGATCGCG
+

0
Entering edit mode

So your index sequences are already in fastq headers (TAATTCGT+ATAGAGGC). There is nothing else to identify from your main read sequences.

You just need to use known index pairs with demuxbyname.sh. You can use hdist=1 option to allow one error in index sequences to recover additional data. e.g. AATTCGT+ATAGAGGC == A*G*TTCGT+ATAGAGGC would be considered equivalent.

Edit: Looks like I had already answered a similar question from you a few days back: C: demuxbyname.sh output help Please don't post similar content in multiple questions. It duplicates effort for you as well as others.