Question: Converting to fastq and demultiplexing Illumina Hiseq4000 data using bcl2fastq
0
gravatar for rubic
3.8 years ago by
rubic190
United States
rubic190 wrote:

Hi,

I'm trying to generate demultiplxed fastq files from my HiSeq4000 run.

I ran 3 paired-end samples in two lanes, indexed by hexamer sequences on both reads. In each lane I spiked PhiX sequences to enrich the diversity.

Specifically, my fragments look like this:

[6bp-index]-[transcript]-[6bp-index]

The transcript parts of the fragments are near the 3' end so my reads are expected to look like this:

read1 - ran for 110 cycles:
[6bp-index]-[104bp transcript]
read2 ran for 55 cycles:
[6bp-index]-[46bp barcodes]-[3-bp polyA]

The Runinfo.xml file in the run folder says each read is 150 bp, the left index is 14 bp, and the right one is 8 bp:

Read Number="1" NumCycles="150" IsIndexedReads="N"
Read Number="2" NumCycles="14" IsIndexedReads="Y"
Read Number="3" NumCycles="8" IsIndexedReads="Y"
Read Number="4" NumCycles="150" IsIndexedReads="N"

I tried several combinations of SampleSheet and --use-bases-mask argument for the bcl2fastq parameters, such as:

Under [Reads] in the SampleSheet only defining read lengths of 150 seems to work:

150
150

And under Data header in the SampleSheet file I define: Lane,Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,index2,Sample_Project,Description

1,Lib1,,,,AR006,GCCAAT,GCCAAT,,
2,Lib1,,,,AR006,GCCAAT,GCCAAT,,
1,Lib2,,,,AR008,ACTTGA,ACTTGA,,
2,Lib2,,,,AR008,ACTTGA,ACTTGA,,
1,Lib3,,,,AR012,CTTGTA,CTTGTA,,
2,Lib3,,,,AR012,CTTGTA,CTTGTA,,

And --use-bases-mask Y104N*,I6N8,I6N2Y,46N*

But the only fastq files I'm getting are of the underdetermined reads. So my questions is whether this is real and I didn't get any of my expected reads and I basically only sequenced PhiX or am I incorrectly specifying the SampleSheet and --use-bases-mask parameters.

Thanks a lot

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by rubic190
2

Just want to confirm that your indexes are "inline" (they appear to be) as designed?

The method you are using is for Illumina indexes which are read as a separate read (they are never part of the actual read). In your example above this run was setup as a 150bp paired end run with a 14bp index 1 and 8 bp index 2. So the pair of illumina indexes can only be used to separate your samples (assuming each sample was labeled with two barcodes). After that point you will need to deal with your inline barcodes separately.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax83k

Can you post a snippet of the reads from your one of your undetermined reads files? The reads should have what the sequencer read as indexes in the fastq header. They would be concatenated as index1index2 in one stretch (14+8 or 13+7) bases.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax83k

Snippets of all four of them, just to be safe.

Lib1 read1:

@K00125:39:HCFWHBBXX:7:1101:11363:998 1:N:0:NTAAAA+NCCACA
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
#--7<FFFFFJJJJJJJJJJJJJJJFFFFJJJJFFJJJFFJJFJFJJJFJJJJJFJJJJJJJJJJJJJJFJFJFJJFJJJJJFJFJJJJJJJJJJFFJJFFFF
@K00125:39:HCFWHBBXX:7:1101:22343:1068 1:N:0:NTCTTT+NGATCT
NAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAACTTTTAATTTTTGCCGCTGAGGGGTTGACCCAGCGAAGCGCGGTAGGTTTTCTGCTTAGGTGT
+
#7<<F-<7<-<-A-<-F--AAFAJA-----AA----<-AAA-<<AA<-7<AFFA----7-7---7A<7<---7--7--7-7-7777<<<AA<-77-77<7<-7-
@K00125:39:HCFWHBBXX:7:1101:8440:1138 1:N:0:NCGGGA+NGATCT
NCTTATCAGAAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTC
+
#-<<<<-<-<FJJJJJ<FFF7FJFFJF<7--A-AFJFJJJJ-<-<-<<F-<7-<<FJJJ-<--7<<--<-<AAJ--7<--7-7--7-7-<777<-<<----7-
@K00125:39:HCFWHBBXX:7:1101:18588:1138 1:N:0:NTGAGA+NGATCT
NGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGAT
+
#--7<<<<--<--AAFA--<--AAAA-<<-AAA--AA7-A-<<--AA<-<<-<-7-<<------77<7-<--77-7-<----77777-7-<-<-<<---77<7<
@K00125:39:HCFWHBBXX:7:1101:29833:1138 1:N:0:NCGGGA+NGATCT
NTTGGGGATTGAGAAAGAGTAGAAATGCCACAAGCCTCAATAGCAGGTTTAAGAGCCTCGATACGCTCAAAGTCAAAATAATCAGCGTGACATTCAGAAGGGT
+
#<<----7AA-F-FJJ7F7AJ-AFJ<---A-AA---<-AA7<--<--7<A<<-<---7--77<---<-7<<-7-<<FJ<FJ<-<7--7-A-<---<-<A77<7

Lib1 read2:

@K00125:39:HCFWHBBXX:7:1101:11363:998 2:N:0:NTAAAA+NCCACA
NAAAAAATAAAACAACCAAAAAAAAAACAAAAAAAAAAAAACAAA
+
#<AA<F<--AFJ<F-A-<FA<<JJJFJ--<FJJ-FF-AFJJ-FFJ
@K00125:39:HCFWHBBXX:7:1101:22343:1068 2:N:0:NTCTTT+NGATCT
NCAGGCAAAAAATTTAGGGTCGGCATCAAAAGCAATATCAGCACC
+
#-A---AAAAFJ<FAJ--------7--<<<<--<<7<<-<--<--
@K00125:39:HCFWHBBXX:7:1101:8440:1138 2:N:0:NCGGGA+NGATCT
NTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCG
+
#<---A<<-A<<-A--FF-----77-7--<-FFAA<-<<7FF----
@K00125:39:HCFWHBBXX:7:1101:18588:1138 2:N:0:NTGAGA+NGATCT
NCCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATACTCT
+
#--AAA-AAAAA<-A--<-<-7<<<7<77-7---7--<<-77-7-7
@K00125:39:HCFWHBBXX:7:1101:29833:1138 2:N:0:NCGGGA+NGATCT
NATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATT
+
#A<<<-<AA<-<-77-<---<77-7--77-7777-<<<AA7F<<<<

Lib2 read1:

@K00125:39:HCFWHBBXX:8:1101:9881:998 1:N:0:NAATTC+NAAAAA
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
#--<7-<FAA<-FJJFFJJJJJJFFFFFJFJJJFFJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJ<FJJJA-7<FFJAJFFJAJJJJJJJJJJJJJJJJJ
@K00125:39:HCFWHBBXX:8:1101:14834:998 1:N:0:NAAAAA+NTCCCC
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
#<--FFAFFAFFJJJJJFAFAFJJFFJJJJJJJJJJFJJJJJJJJFJ<FFJJJJJJJJJJJJJFJJJJJJJJJFF-<FFJ7F<FJJJJJJJJAJJJJJJJJJJ
@K00125:39:HCFWHBBXX:8:1101:7466:1033 1:N:0:NAAATC+NACCAA
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
#-<-<77FFFFFJJFFJJJJJJJJJJFFJAAFJJJJFFJJJJJJFJJFFJJJJJJJJ-JJJJJJJJAFJJJJJJJJFJJJJJJJFJJJJJJJF<JJJJJJJJJ
@K00125:39:HCFWHBBXX:8:1101:1570:1086 1:N:0:NAAAAA+NAAAAC
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
#-<7<<-<7FAAJJJFJJJJJJJJJJJJFFJJFJJJJJFFFFFJF<FFFFJJJJJFJFJJJJJJJJJJFAFJJJ-<A<FJJJJJJJAFJFJJJJJJFFJJJJJ
@K00125:39:HCFWHBBXX:8:1101:25865:1121 1:N:0:NATATC+NCAACA
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAA
+
#--7--7<F<<<FFJ-AAFFFJFFJJJJFJJJ-AAAAFFFFJJJJFFFF<-<AAFJFFFFF7AFFFFFJJJJAFFFJJFFFJFFF-7AFJJF-FFFJJ7FF-7

Lib2 read2:

@K00125:39:HCFWHBBXX:8:1101:9881:998 2:N:0:NAATTC+NAAAAA
NAATATCAACCAAAACAAACAATATAAAAAAAATAAAACAAAA
+
#-A----<--7-AA7---<--<----<<FJJJJ-7-<--<<<7
@K00125:39:HCFWHBBXX:8:1101:14834:998 2:N:0:NAAAAA+NTCCCC
NTTACCTTTCCCGGCCCCCCCTCCATCTACCACACAAAGGGGAAC
+
#--A--------AA--------------<--<-<--<7---77--
@K00125:39:HCFWHBBXX:8:1101:7466:1033 2:N:0:NAAATC+NACCAA
NAAAAAAAAAAAAAAAAAACAAACAAAAAAAAACACAAAATAAAA
+
#AA--FAF7JJ-A7AFFJJ-<<F-<JJF<7<A--<<-<-<-<<-A
@K00125:39:HCFWHBBXX:8:1101:1570:1086 2:N:0:NAAAAA+NAAAAC
NCCATGCAAAAAAAACAACAACCAAACAAAACAAACACACAAAAA
+
#A-<---A-AA--FA-----<--<F--<FF<-<-<<<-<-<-<AJ
@K00125:39:HCFWHBBXX:8:1101:25865:1121 2:N:0:NATATC+NCAACA
NAGTTTTTGTTCCTATTTTTTCTCGCATTCCTTTCCTTCCCTTGTT
+
#---A-<---------A-----------------------------
ADD REPLYlink written 3.8 years ago by rubic190

That change produces this error: std::exception::what: UseBasesMask formatting error. Mask size does not match number of cycles in RunInfo.xml. RunInfo.xml cycles: 150 Base mask:

ADD REPLYlink written 3.8 years ago by rubic190

Please use ADD REPLY/ADD COMMENT to provide additional information on existing posts.

ADD REPLYlink written 3.8 years ago by genomax83k

Edit the RunInfo.xml to following and try @Harold's solution again (please save a copy of the original file with a new name first)

Read Number="1" NumCycles="150" IsIndexedReads="N"
Read Number="2" NumCycles="14" IsIndexedReads="N"
Read Number="3" NumCycles="8" IsIndexedReads="N"
Read Number="4" NumCycles="150" IsIndexedReads="N"

Your sequences will be contained in R1 and R4 files (and the header inside the file will say 4:N: etc) for file R4.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax83k
1
gravatar for harold.smith.tarheel
3.8 years ago by
United States
harold.smith.tarheel4.5k wrote:

You provide two incompatible descriptions of the run:

read1 - ran for 110 cycles:
[6bp-index]-[104bp transcript]
read2 ran for 55 cycles:
[6bp-index]-[46bp barcodes]-[3-bp polyA]

but then

Read Number="1" NumCycles="150" IsIndexedReads="N"
Read Number="2" NumCycles="14" IsIndexedReads="Y"
Read Number="3" NumCycles="8" IsIndexedReads="Y"
Read Number="4" NumCycles="150" IsIndexedReads="N"

The runInfo.xml correctly describes the data structure of the sequencing run. As GenoMax2 said, Illumina indexes are contained in the adapters and sequenced separately (Reads 2 & 3 for dual indexing). You used inline barcodes at the beginning of the insert reads (Reads 1 & 4). So the syntax you want for demultiplexing is:

--use-bases-mask I6Y104n*,n*,n*,I6Y46n*
ADD COMMENTlink written 3.8 years ago by harold.smith.tarheel4.5k

@rubic: It appears that your samples were put on a completely unrelated run (if you did not make use the Illumina indexes). You would want to check the output carefully to ensure that @Harold's solution worked as intended.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax83k

Sorry about that, I made a mistake in my description.

The reads are:

read1 - ran for 110 cycles:

[104bp transcript]
[6bp-index]

read2 ran for 55 cycles:

[6bp-index]
[49bp transcript]
ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by rubic190

RunInfo.xml describes how the actual sequencing ran (irrespective of what you wanted to to run your sample as).

I don't understand what this means [6bp-index][46bp barcodes]. Are both of those inline in Read 2? I think you are going to be better off dealing with the inline barcodes outside bcl2fastq.

Did you use or not use illumina barcodes (just so we get that out of the way)?

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax83k

Sorry about that. The barcodes are internal to my construct not illumina's. I corrected it.

So for simplicity: Read 1 was sequenced for 110 cycles where the last 6 bp are the hexamer index. Read 2 was sequenced for 55 cycles, where the first 6 bp are the same hexamer index.

Trying --use-bases-mask Y110N*,I6N*,I6N*,Y55N* or Y104N*,I6N*,I6N*,Y49N* only yields undetermined reads.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by rubic190

Have you tried Y104I6n*,n*,n*,I6Y49n*. I have no idea if this will work and if it does not work (assume you have edited the RunInfo.xml file as indicated in my post above?) then I would say just collect the 150 bp PE reads and then post-process outside bcl2fastq.

ADD REPLYlink written 3.8 years ago by genomax83k

'I' = use this sequence as the index. Since your index is NOT part of Illumina index reads 2 & 3, you need to ignore those (i.e., use n* instead of I6n*).

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by harold.smith.tarheel4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1813 users visited in the last hour