Question: Trouble de-multiplexing single index paired-end illumina library
0
gravatar for caverill
4 weeks ago by
caverill40
caverill40 wrote:

I recently received a sequencing run from a sequencing center. This is an Illumina paired end Miseq run, Single-index. I usually get per-sample files post demultiplexing, however this time I got the whole lane, and with it just four raw files, listed below:

map_file.txt
Undetermined_S0_L001_I1_001.fastq
Undetermined_S0_L001_R1_001.fastq
Undetermined_S0_L001_R2_001.fastq

I would usually just send this through split_libraries_fastq.py in qiime, however I am not sure how to do this with the indexing. I would like to demultiplex these reads with minimal quality filtering. The ideal output would be two directories with forward and reverse reads for each sample as .fastq files.

I have copied the output of head file.path below for each file, if this is helpful. I realize the first two reads in the sequence files are probably junk.

head map_file.txt

#SampleID   BarcodeSequence LinkerPrimerSequence    sample_type Description geneticSampleID
OSBS.087.39.M.32.18.20140227    TCCCTTGTCTCC    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A1   OSBS_087-M-32-18-20140227-gen
OSBS.048.41.M.37.33.20140227    ACGAGACTGATT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A2   OSBS_048-M-37-33-20140227-gen
OSBS.048.23.M.15.31.20140227    GCTGTACGGATT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A3   OSBS_048-M-15-31-20140227-gen
OSBS.047.21.M.20.3.20140227 ATCACCAGGTGT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A4   OSBS_047-M-20-3-20140227-gen
OSBS.119.23.M.18.38.20140227    TGGTCAACGATA    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A5   OSBS_119-M-18-38-20140227-gen
OSBS.047.41.M.22.36.20140227    ATCGCACAGTAA    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A6   OSBS_047-M-22-36-20140227-gen
OSBS.087.41.M.40.21.20140227    GTCGTGTAGCCT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A7   OSBS_087-M-40-21-20140227-gen
OSBS.048.21.M.5.11.20140227 AGCGGAGGTTAG    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A8   OSBS_048-M-5-11-20140227-gen
OSBS.119.39.M.27.5.20140227 ATCCTTTGGTTC    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A9   OSBS_119-M-27-5-20140227-gen

head Undetermined_S0_L001_I1_001.fastq

@M02149:120:000000000-ACC8C:1:1101:15995:1333 1:N:0:0
NTTTTCTTTTTT
+
#11111133111
@M02149:120:000000000-ACC8C:1:1101:14849:1368 1:N:0:0
NCTTTTTTTTTT
+
#11111111000
@M02149:120:000000000-ACC8C:1:1101:15950:1377 1:N:0:0
NCTTTTTTTTCT

@M02149:120:000000000-ACC8C:1:1101:15995:1333 1:N:0:0

head Undetermined_S0_L001_R1_001.fastq

TTTTTTTCCCCTTCTTCTTTCTCCTTCTTTTTTTCTTTCCCTCTCTTTTTTTTTTCCCTCCCTTCTTTTCTTTTTCCTTTTTTTCTTTTTCTTTCCTCCTCCTCTTCCCTCTCTTTCTTTTTTTTTCTCTCCTTTTTTCCCCTTTTCTCTTTTTTTTTTTTTCTTCTTTTTTCTTTTTTTTTTTTTTTTCTTTCTTCTTCTTTTCTCTTTCTTCTTTCTTCTTTTTTTTTCTCTTCCTTCTTTCCTCTCTT
+
>A1>ADD11BFBBFFGB3BFABA1BF1111110A0111211A00012111/////01110//01011112111110112211//01111101112111000000111101000121221111///-01111011111-/00//.0000000000--------/////////-//////--9--------//9//////////////////////////////////----/////////////////////
@M02149:120:000000000-ACC8C:1:1101:14849:1368 1:N:0:0
TTTTTTTCCCCTTCTTCTTTCTCCTTCTTTTTTTTTTTTTTTTCTCTCTTTTTTTTCTTTTCCTCCTTTTTTTCTTTTTTTTCTTTTTTTCTTCCCCTCCCTTTTTTCTCTTTTTTTTTTCTTTTTTTTTTCCTCTTCTTTTTTTTTCCTTTTTCTTTTTTTTTTCTTTCTTTTCTCTTTCTCTTTTTCTTTTTTTTTCTTTTTTTTTCTTTTCTTTTCTCTTCTTTTTTTTCTCTTCTTTCTTTTCTTCT
+
>A1>ADD11DFBFGGGG3GGFBD1BG1111110//////////0122112111///011112111001211//011111///011111--01111///.../0000-/0000000-----/00000-----////////////----/////////////-----/////////////////////////////----//////---9/////////////////////---///////////////////
@M02149:120:000000000-ACC8C:1:1101:15950:1377 1:N:0:0
TTTTTTTCCCCTTCTTCTTTCTCATTCTTTTTTTTTCTTCTCCCCCTCCCCTTTTTCCTTCCCTTTCTTTTCTTCTTCTTTTCCTCCCCTTCTCCCTCTCCCCCTCTTCCCTCTCCCTCCTCCTCCTCCCCTCTCTTTTTTTTCTTTCTTTTCTTCTTTTTTCTCTCTCTTTTTTCTTTTTTCTTTCTCCTTCTTTTTTCTTTTTTCTTTCTTCTTTTCTTTTCTCCTCCTTCTTTTTTTTCTTTTTCTTT

head Undetermined_S0_L001_R2_001.fastq

@M02149:120:000000000-ACC8C:1:1101:15995:1333 2:N:0:0
NTTTTTTTCCCTTTTTTTTTTTTTTTTTTTTTTCTCTCTTCTCTTTTTTTCTCTTTTTTTTTTTTCTTTTTCTTTCTCTTTCCTTCTTTTTCCTTTCCCCCTCCCTTCCCCCCCTTTCTCCTTTCTTTTTTTTTTTCTTTTTTTTTTTTTTTTTCCTTCTCTTCTTTTTCTCTTTTTTTCTCTTTCCTCTTCTTCTCTTTCTCTTTCTTTTTTTTTTCTTTTTTTTCCTCTTCTTTTTTTTTTCTTTTTTT
+
#1>>>>>>11B1BFEF000A/AA/////////>012211212122111//01222111////-<-000000/0000000000000000000/0000000/.-.../0////.---.000///////////------//////------------///////////////////////--/////////////////////////////////-----//////---////////////-----//////--
@M02149:120:000000000-ACC8C:1:1101:14849:1368 2:N:0:0
NTTTTTTTCCCTTTTTTTTTTTTTTTTTTTTTTTTCTCTTCTTTTTCTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTCTCTTCTTTCTTCCCCCTTTTCCTTTCTCCTTCTCTCTCCCTTTCTTTTCTCTCTTTCTTTTTTTTCTTTTTTTTTTCCTTCTTTCTTTTTTTTTTTTTTTTCCCTCTCCTTTTTTTTCTTCTTTTTTTTTTTTTCTTTTTTTTTTTCTTCTTTTTTTCTTTTCTTTCTT
+
#1>>>>>>>B@DEFGF0A0A//A////>///////01222121111011111/011111///-------------------//////////////-----//////////////////////-/////////////////////---//////-----//////////////-----------////--///////---/////////--------//////------/////////--////////////
@M02149:120:000000000-ACC8C:1:1101:15950:1377 2:N:0:0
NTTTTTTTCCCTTTTTTTTTTTTTTTTTTTTTTTCTCTTTCCTTTCTTTTTTTTTTCTTTTTTTTTTTTTCTCCTTCCTTCCCCTCTTCCCTTCCCTCCCCCTCCTCCTCCCCTCCTCCCTCTCTTCCTTTTCTTTTTTCTCTCTTTTTTTTTTCTTCTTCCTTCTTTCCTTTTCCCTCCTCTCTTTTCCCTTCTTTTTCTTTTTTTCTTTTTTTTCTTTCTCTTTCTTTCCTTTCCTTTTTTTTTTCTTT
ADD COMMENTlink written 4 weeks ago by caverill40
1

You should ask your sequencing provider to do the right thing and reprocess the data so you get it in the format you are used to. This is bad customer service to dump non-demultiplexed data in customers lap and expect them to demultiplex it themselves.

If you have no option but to do this yourself then look at deML to get this done. sabre is another option.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax58k

I'm with you on the "this is not cool" front. Unfortunately I have no recourse. If I ran the world raw data would be pased along as well as an extremely simple shell script or similar to demultiplex to per-sample files, and a readme with how to change the demultiplex settings. Thanks for the links!

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by caverill40

For some odd reason, it's actually easier to demultiplex with QIIME than try to get QIIME to work with demultiplexed data. The QIIME users I know specifically ask for non-demultiplexed data. At least this was the case with QIIME 1.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by igor7.0k

I remember those days. However, in my experience this was before single or double index paired end sequencing.

ADD REPLYlink written 4 weeks ago by caverill40

I was using QIIME demultiplexing for paired-end single index, so 3 total reads like in the original post.

ADD REPLYlink written 4 weeks ago by igor7.0k

I would love to run this through qiime- can you copy/paste the qiime command you were using? split_libraries_fastq.py? I thought this required the reads to already be paired.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by caverill40

This was my old protocol for QIIME 1.8. Not sure if it is still valid, but maybe it will give you some ideas.

# check fastq header
zcat FASTQ/lane1_NoIndex_L001_R1_001.fastq.gz | grep ":N:" | head
zcat FASTQ/lane1_NoIndex_L001_R2_001.fastq.gz | grep ":N:" | head

# fix barcode read to match R1 headers
zcat FASTQ/lane1_NoIndex_L001_R2_001.fastq.gz | sed 's/2:N:[0-9]\+:/1:N:0:/g' | gzip > FASTQ/lane1_NoIndex_L001_R2_001.fixed.fastq.gz

# merge R1 and R3
join_paired_ends.py \
--min_overlap 10 \
--perc_max_diff 20 \
--forward_reads_fp FASTQ/lane1_NoIndex_L001_R1_001.fastq.gz \
--reverse_reads_fp FASTQ/lane1_NoIndex_L001_R3_001.fastq.gz \
--index_reads_fp FASTQ/lane1_NoIndex_L001_R2_001.fixed.fastq.gz \
--output_dir 1_join_paired_ends

# validate mapping file
validate_mapping_file.py -m map.txt -o validate_mapping_file

# demultiplex
split_libraries_fastq.py \
--rev_comp_mapping_barcodes \
--phred_quality_threshold 19 \
--sequence_max_n 5 \
--sequence_read_fps 1_join_paired_ends/fastqjoin.join.fastq \
--barcode_read_fps 1_join_paired_ends/fastqjoin.join_barcodes.fastq \
--mapping_fps map.txt \
--output_dir 2_split_libraries_fastq_q20_n5_all

In my example, R2 corresponds to I1 and R3 to R2. That depends on how exactly you run bcl2fastq.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by igor7.0k

Workng with deML I ran the command:

deML -i index.txt -f  Undetermined_S0_L001_R1_001.fastq.gz -r  Undetermined_S0_L001_R2_001.fastq.gz -if1 Undetermined_S0_L001_I1_001.fastq.gz -o demultiplexed/

Conflicts for index1:
WARNING: deML has detected that you have 1024 open file descriptors out of a max. of 1024
         therefore you have reached the maximum. If the information is correct, certain files might be empty
         Either:
           1) Use BAM as input/output
           2) Check "ulimit -n" and put a higher number e.g. "ulimit -n 1024"
              If you already at the limit, increase the system limits

Any pointers? Should I create a new question?

ADD REPLYlink written 4 weeks ago by caverill40

Create a new question.

ADD REPLYlink written 4 weeks ago by genomax58k

I had to increase the limit by entering ulimit -n 4096. This was the max i could set on my cluster on the login node. I actually needed to go higher, and this was possible once I setup an interactive session or did this via a batch job.

ADD REPLYlink written 4 weeks ago by caverill40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1502 users visited in the last hour