Question

demuxbyname.sh output help

0

Entering edit mode

4.8 years ago

damt0320 • 0

Hi, i have a question about the demultiplexing output with demuxbyname. I have demultiplexed a dataset recently with demuxbyname from the bbmap library. i have noticed that there's a lot of files in the output, some of them are large files and otrher are small files. Within my dataset, i have 120 ID's, 2 barcodes per ID ( so im supposed to get 240 files, 2 files per ID, the forward and reverse if im not mistaken). But when i get the index from my dataset, i get more than 240 barcodes ( im getting like 400 barcodes, 200 lines (2 barcodes per line)). So i have a couple of questions: 1) when i get the indexes from my dataset, why im getting more than the 240 barcodes im supossed to get ? (like i said, i have 120 ID's, so its supossed to have 2 barcodes per ID) 2) Once i have demultiplexed my dataset, how can i know which files are the right ones and which files i have to delete? i was thinking in mapping each pair of fastq files, but i dont know if this can help me to know which files are the right ones, i just want to get the 240 fastq files of my dataset. 3) if mapping can help, do you recommend any specific program to do it from command line ?

I apologize if this questions are kinda dumb, but im still learning about bioinformatics and i dont have too much knowledge in this area. Any help will be appreciated. Thanks !!.

demultiplex mapping demuxbyname • 4.2k views

ADD COMMENT • link 4.8 years ago by damt0320 • 0

0

Entering edit mode

Can you provide the command line used for this run? Generally Illumina data will have index sequences that differ by one or more nucleotides (or may have N) than the expected set of indexes. Those could generate more than expected number of files.

ADD REPLY • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

Hi genomax. The code used for de index.txt was

zgrep '@M' 2-1_S2_L001_R1_001.fastq.gz | cut -d ":" -f10 |sort |uniq >index.txt

and the code used to run demuxbyname was the following:

demuxbyname.sh in=2-1_S2_L001_R1_001.fastq.gz in2=2-1_S2_L001_R2_001.fastq.gz out=%_R1.fq.gz out2=%_R2.fq.gz suffix names=index.txt

ADD REPLY • link 4.8 years ago by damt0320 • 0

0

Entering edit mode

So that explains the greater number of files you are observing. I would just leave the index combinations you expect (and thus know are real) and remove the rest (or simply ignore the other files that have those indexes).

ADD REPLY • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

But then i get this:

Set INTERLEAVED to false

Input is being processed as paired Time: 3.809 seconds. Reads Processed: 6470116 1698.83k reads/sec Bases Processed: 1623999116 426.41m bases/sec Reads Out: 0 Bases Out: 0

Is the problem that i mentioned, i have an excel spredsheet where i have 120 ID's and 2 barcodes per ID. When i run demuxbyname with that 120 ID barcodes i get that error, but when i get the barcodes with the code i posted (from the output after the sequencing) , i get a lot of files. It is as if at the moment of sequencing in miseq, those barcodes that I have in the excel file are transformed into other barcodes. I don't really understand what happens with that dataset. Perhaps is something intrinsec in the illumina sequencig process but i dont get it.

ADD REPLY • link 4.8 years ago by damt0320 • 0

0

Entering edit mode

But then i get this:

    Set INTERLEAVED to false
Input is being processed as paired
Time:               3.809 seconds.
Reads Processed:    6470116     1698.83k reads/sec
Bases Processed:    1623999116  426.41m bases/sec
Reads Out:          0
Bases Out:          0

Is the problem that i mentioned, i have an excel spredsheet where i have 120 ID's and 2 barcodes per ID. When i run demuxbyname with that 120 ID barcodes i get that error, but when i get the barcodes with the code i posted (from the output after the sequencing) , i get a lot of files. It is as if at the moment of sequencing in miseq, those barcodes that I have in the excel file are transformed into other barcodes. I don't really understand what happens with that dataset. Perhaps is something intrinsec in the illumina sequencig process but i dont get it.

ADD REPLY • link 4.8 years ago by damt0320 • 0

score 2 · Answer 1 · 2020-09-03

$ more names
TCCGCGAA+GGCTCTGA
TAATGCGC+CCTATCCT
TCCGCGAA+CCTATCCT
ATTCAGAA+CCTATCCT
CGGCTATG+CCTATCCT

$ demuxbyname.sh -Xmx10g in1=Undetermined_S0_L001_R1_001.fastq.gz in2=Undetermined_S0_L001_R2_001.fastq.gz out1=out_%_R1.fq.gz out2=out_%_R2.fq.gz names=names prefixmode=f

Should produce

$ ls -1 out_*.gz
out_ATTCAGAA+CCTATCCT_R1.fq.gz
out_ATTCAGAA+CCTATCCT_R2.fq.gz
out_CGGCTATG+CCTATCCT_R1.fq.gz
out_CGGCTATG+CCTATCCT_R2.fq.gz
out_TAATGCGC+CCTATCCT_R1.fq.gz
out_TAATGCGC+CCTATCCT_R2.fq.gz
out_TCCGCGAA+CCTATCCT_R1.fq.gz
out_TCCGCGAA+CCTATCCT_R2.fq.gz
out_TCCGCGAA+GGCTCTGA_R1.fq.gz
out_TCCGCGAA+GGCTCTGA_R2.fq.gz

You can add hdist=1 if you want to allow one error and so on.