Question

Demultiplexing based on dual indices in headers while allowing 1 mismatch to each index

0

Entering edit mode

3.8 years ago

Rezenman • 0

Hey all, I am looking for a tool that will help me demultiplexe my Novaseq samples by two dual indices in the headers. Since I have designed my indices such that the minimum hamming distance will be 3 I want to allow one mismatch per index while demultiplexing in order to salvage as many reads a possible. Up to now, I have used demuxbyname from BBmap but it does not allow any mismatches. Any help will be appreciated :)

Fastq header example: @A00929:83:HL75TDRXX:1:2101:13431:1047 2:N:0:AGGCAGAA+NCTCTCCG

next-gen sequencing • 3.9k views

ADD COMMENT • link updated 7 months ago by Gabriel R. ★ 2.9k • written 3.8 years ago by Rezenman • 0

0

Entering edit mode

This is easily done by bcl2fastq when the data is originally demultiplexed. You may want to ask your sequence provider about this next time.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Hey, thanks for the reply. According to my pipeline, I am usually working on the fastq files before demultiplexing (trimming, quality control, etc.), and demultiplex only in the last step, hence(and I forgot to mention), I am looking for a tool that can work with fastq files. Thanks

ADD REPLY • link 3.8 years ago by Rezenman • 0

GenoMax · Accepted Answer · 2020-06-18

3

Entering edit mode

3.8 years ago

Gabriel R. ★ 2.9k

We have published a program called deML:

https://grenaud.github.io/deML/

See out paper:

Renaud G, Stenzel U, Maricic T, Wiebe V, Kelso J. deML: robust demultiplexing of Illumina sequences using a likelihood-based approach. Bioinformatics. 2015;31(5):770-772. doi:10.1093/bioinformatics/btu719

It handles partial matches and uses maximum-likelihood to assign reads to the different samples. It also reports which unassigned indices were unassigned and not found. It is also robust to sequences of poor quality and poorly designed index list. It handles dual indices too.

ADD COMMENT • link 3.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Thanks I'll look into that

ADD REPLY • link 3.8 years ago by Rezenman • 0

0

Entering edit mode

let me know if you need help!

ADD REPLY • link 3.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Hey Gabriel, Can it handle index sequences that are found in the headers ( such as the format I uploaded above)?

ADD REPLY • link 3.8 years ago by Rezenman • 0

0

Entering edit mode

Normally, you should put them as a separate file. However, you can transform your indices in the header into separate files easily using UNIX commands. If you need a hand, paste a few sequences with the headers and I can have a look.

ADD REPLY • link 3.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Rezenman : Use the code in my answer here to get a listing of all indexes present and their counts: C: Demultiplexing reads with index present in the labels

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Hey Gabriel, I am adding some sequences for you to take a look, just to make sure that I understand correctly: in which format should I add the index files? Thanks a lot!

@A00929:83:HL75TDRXX:1:2101:13919:1047 1:N:0:GTAGAGGA+NATCCTCT
CATATTGATAGTTCGCACAGGTAG
+
FFFFFFFFFFFFFFFFFFFFFFFF
@A00929:83:HL75TDRXX:1:2101:14009:1047 1:N:0:AGGCAGAA+NCTCTCCG
GTGCGTATCTATCAAAAATGTATA
+
FFFFFFFFFFFFFFFFFFFFFFFF
@A00929:83:HL75TDRXX:1:2101:14027:1047 1:N:0:GTAGAGGA+NATCCTCT
AAAACCCTGGCCAACATTGAAGGT
+
FFFFFFFFFFFFFFFFFFFFFFFF

ADD REPLY • link updated 3.8 years ago by GenoMax 141k • written 3.8 years ago by Rezenman • 0

0

Entering edit mode

I copied your file as "biostars.fq.gz"

First extract your first indices:

zcat biostars.fq.gz  |awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-16,8); print "+"; print "FFFFFFFF"} }' |gzip  > index1.fq.gz

Second extract the second:

zcat biostars.fq.gz  |awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-7,8); print "+"; print "FFFFFFFF"} }' |gzip  > index2.fq.gz

I put 'F' as quality score, so the standard 37 for novaseqs.

Then run deML:

deML -o /tmp/output -i [PATH TO YOUR ]index.txt -f /tmp/biostars.fq.gz -if1 /tmp/index1.fq.gz -if2 /tmp/index2.fq.gz

Let me know if that works for you

ADD REPLY • link updated 3.8 years ago by GenoMax 141k • written 3.8 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

Hey, I had a similar problem as the OP and found your comment. I ran your deML on one sample to test it out and it seems to work well. However, I have 8 fastq files (4 forward and 4 reverse) that I need to run this on. Is this program set up such that it can run in parallel or would I need to use a for loop? Thanks for any info you can provide!

ADD REPLY • link 7 months ago by tjroger86 ▴ 10

1

Entering edit mode

You can use a for loop or start processes in parallel if you have the hardware to support this.

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

GenoMax has a good comment and feel free to email me if you need further help.

ADD REPLY • link 7 months ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Hey all, Thanks for your help it worked good!

ADD REPLY • link 3.8 years ago by Rezenman • 0

2

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 3.8 years ago by GenoMax 141k