Hello! I could use some help demultiplexing our ddRAD data. We have Hi-Seq 4000 paired end data back from the sequencing facility, but they did not demultiplex for us so we have "Undetermined...fastq.gz" files with an "N" proceeding each barcode in the headers as well as the sequences themselves. I will paste an example below:
@K00337:359:HGJV5BBXY:7:1101:1499:1314 2:N:0:NATCCATG+NATTCATG
NCGCATCATGAACCATTACCGTTCAAAATTCCAGAGAGACTATAATACCTGTGATATGTAGGATTACTGAGATAAATTAATGATCCAATAGCCTGTATGTTTAAACTAGATCTTTGTTAGTATTACATAGAGCTATGGGTTGTAATTTTTC
+
#A<FF<FFJJJAFJJJ<FFJJJJJJFAJJJJJJJJJJAFJAJJJAJFFJJJ<A7JJJJJJJJ<FJ-AFFFJAFJFJJFJAJJ-AAFAFJFJJJF7F-7F<JJF-7-7FFFA<-<AFFAF7F<FAJJJ----<--7A<--A-<JA-F--<--
@K00337:359:HGJV5BBXY:7:1101:1681:1314 2:N:0:NGCCATCT+NACCGAGC
NTGCTGTCATGCTCTGATATCAGGCGGCTGTGGTCACACATCTCCTCTCGCTGTGGCCGAACCAGAAGCAGATATGAATGCAGGCTGCCTAAATTCTTCCTACTGCACTCCTTTCGGAGATTGCTGATCGTATTGTACTGCCCCCAGAACC
+
#A<<F--7FJJFFJFJ-F-7FJJJ-<FFFAA-<J<JJJJJ-F-FJF<JF7JA<<-<J---7FA-<-A7AF<-AJJ---<-<-77AF--7FF--7-----777A-7-7-77<-7-7--7-7-AF7----77--A-------A)-)))))---
@K00337:359:HGJV5BBXY:7:1101:1824:1314 2:N:0:NCCTATCA+NCTACGCC
NACATGTGGCAAGAAAGGAGGAAAAAAAGAGAGGAGGAGGAGCCAGGCTATTTTTAGCAATCAGATCTTATGGAAACTAATACTGAGAAGTCACTCGTTACGATGGCGGGGAGTCTGCAATTAGCTCGCCCCACGCTCGTCCAGGCTTCTG
We are hoping to demultiplex first with only the I7 index (our i5 index is being used to determine PCR duplicates). Then we will demultiplex once more using our inline barcodes.
We are currently losing 100% of our reads to ambiguous barcode drops (which we assume relates to this N insertion), even though we are specifying that we will allow for at least one mismatch.
Does anyone have any ideas on what program we could use to demultiplex this?
And has anyone ever encountered this issue? We are trying to figure out what went wrong so we can prevent this with future libraries.
Are there N's in those first position for all the reads?
Can you try
deML
(A: demultiplexing tool for dual-indexed paired-end illumina libraries ) that is described in this comment.Thank you for replying so quickly! Yes, there is an N in that first position for all of our reads. Would you happen to know why this is the case?
We did start working with deML this morning after finding a similar thread to the one you've linked (Demultiplexing based on dual indices in headers while allowing 1 mismatch to each index ), but we are getting a persistent error that we are trying to troubleshoot.