Question

Demultiplexing reads with index present in the labels

3

Entering edit mode

6.6 years ago

tiago211287 ★ 1.4k

Hello,

Earlier I had a problem ( already solved, thanks to the help of Brian Bushnell and Genomax), in which my index reads were not supply in a separated file but in the fastq labels, like this example:

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

With the aid of demuxbyname script from BBSuiteTools, I was able to demultiplex all reads with indexes containing no mismatch.

I then got nearly 90 % of the reads using this approach, but I am thinking in how I could extract from the remaining 10%, reads with indexes containing up to 1 mismatch.

Do anyone know some method for doing this?

demultiplex illumina index rnaseq • 8.8k views

ADD COMMENT • link 6.6 years ago by tiago211287 ★ 1.4k

score 7 · Answer 1 · 2017-09-04

7

Entering edit mode

6.6 years ago

GenoMax 141k

You are not going to get all 10% of the remaining ones. If you look at the indexes in undetermined pool you will have a combinatorial mishmash of sequences that represent the entire spectrum of possible tags. Use the following script to see what tags are there (along with number of reads) in the undetermined file.

You would use following something like: zcat undetermined.fastq.gz | awk -f test.awk

test.awk should contain:

BEGIN { FS = ":"; }

((NR % 4) == 1) { barcodes[$10]++; }

END {
  for (bc in barcodes) {
            print bc": "barcodes[bc]"";
    }
}

You could then choose to do another run of demuxbyname.sh and get additional reads out, if you consider that there are reads worth salvaging. Depending on what you are doing you may be better off using perfect match tag reads.

ADD COMMENT • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your comment. I was aware that I would not get all 10%, but I was hopping to recover ~3% from it. Indeed, it seems to me a good idea to look the number and kind of tags are present in the undertermined file.

Contrary to what I thought, most of the reads are crap:

AAAAAGAA-AAAAAAAA/1;1: 14872
AAAAAAAG-AAAAAAAA/1;1: 15295
TTTTTTTT-TCTTTCCC/1;1: 15342
GAAAAAAA-AAAAAAAA/1;1: 19479
CCTAGAAT-TCCTTGGG/1;1: 19529
AAAAAAAA-AAAAAGAA/1;1: 19889
AAAAAAAA-AAAAAAAA/1;0: 20250
AAAAAAAA-AAAAAATA/1;1: 21386
TAAAAAAA-AAAAAAAA/1;1: 22318
AAAAAAAA-AAAGAAAA/1;1: 32298
AAAAAAAA-AAAAAAAA/1;1: 172855
NNNNNNNN-NNNNNNNN/1;0: 787700

So I think you are right. I will stay with the perfect matches. Thank you again!

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k