bcl2fastq --use-bases-mask error for dual index reads
2
0
Entering edit mode
3.5 years ago
Apex92 ▴ 280

Hi,

I have MiniSeq run (single-end read sequencing) data. All of my samples are dual index and all indices have 8nt length.

Below is the RunInfo.xml file:


<RunInfo xmlns:xsd="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema"="" rel="nofollow">http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
Version="4">
<Run Id="201029_MN00153_0075_A000H37WNG" Number="75">
<Flowcell>000H37WNG</Flowcell>
<Instrument>MN00153</Instrument>
<Date>201029</Date>
<Reads>
  <Read Number="1" NumCycles="151" IsIndexedRead="N"/>
  <Read Number="2" NumCycles="8" IsIndexedRead="Y"/>
  <Read Number="3" NumCycles="8" IsIndexedRead="Y"/>
</Reads>
<FlowcellLayout LaneCount="1" SurfaceCount="2" SwathCount="3" TileCount="10" SectionPerLane="1" LanePerSection="1">
  <TileSet TileNamingConvention="FiveDigit">
    <Tiles>
      <Tile>1_11102</Tile>
      <Tile>1_21102</Tile>
       .
       .
       .
     </Tiles>
  </TileSet>
</FlowcellLayout>
<ImageDimensions Width="2592" Height="1944"/>
<ImageChannels>
  <Name>Red</Name>
  <Name>Green</Name>
</ImageChannels>

I also have my SampleSheet.csv in the below format:

[Header]
Local Run Manager Analysis Id,7007
Experiment Name,2020-10-29
Date,2020-10-29
Module,GenerateFASTQ - 2.0.1
Workflow,GenerateFASTQ
Library Prep Kit,Nextera DNA CD Indexes - 96 Indexes Plated
Chemistry,Amplicon

[Reads]
151

[Settings]
adapter,CTGTCTCTTATACACATCT

[Data]
Sample_ID,Sample_Name,Description,Index_Plate_Well,index,I7_Index_ID,index2,I5_Index_ID,Sample_Project
1,1,,A01,ATTACTCG,H701,TATAGCCT,H505,
2,2,,A02,ATTACTCG,H702,ATAGAGGC,H506,
3,3,,A03,ATTACTCG,H703,CCTATCCT,H517,
4,4,,A04,ATTACTCG,H705,GGCTCTGA,H505,
5,5,,A05,ATTACTCG,H707,AGGCGAAG,H506,
6,6,,B01,ATTACTCG,H702,TAATCTTA,H517,
7,7,,B02,ATTACTCG,H703,CAGGACGT,H505,
8,8,,B03,ATTACTCG,H701,GTACTGAC,H506,
9,9,,B04,TCCGGAGA,H707,TATAGCCT,H517,
10,10,,B05,TCCGGAGA,H723,ATAGAGGC,H505,
11,11,,C01,TCCGGAGA,H703,CCTATCCT,H506,
.
.
.

I used the below command line:

>bcl2fastq --runfolder-dir /proj/data --output-dir /proj/data --sample-sheet /proj/data/SampleSheet.csv --use-bases-mask Y*,I8n*,I8n*,Y* --barcode-mismatches 0**

But I get an error in --use-bases-mask Y*,I8n*,I8n*,Y*. I am not sure whether this is a suitable --base-mask approach for single-end reads with dual index.

The error:

ERROR: bcl2fastq::common::Exception: 2020-Nov-05 17:44:00: Success (0): /sw/apps/bioinfo/bcl2fastq/2.20.0/src/bcl2fastq/src/cxx/lib/layout/UseBasesMask.cpp(61): Throw in function bcl2fastq::layout::UseBasesMask::UseBasesMask(std::string, std::vector<bcl2fastq::layout::ReadMetadata>::const_iterator, std::vector<bcl2fastq::layout::ReadMetadata>::const_iterator)
Dynamic exception type: boost::exception_detail::clone_impl<bcl2fastq::layout::UseBasesMaskFormatError>
std::exception::what: UseBasesMask formatting error. A mask must be specified for each read. Number of reads: 3 Base masks: 'y*,i8n*,i8n*,y*'

Can any one please help me with this issue?

sequencing software error bcl2fastq • 6.7k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
GenoMax 141k

You should just use --use-bases-mask Y*,I8,I8. There are no extra bases in index reads so specifying an exact I8,I8 should do the trick and there is no second read so you don't need the second Y*.

ADD COMMENT
0
Entering edit mode

@genomax Thank you for your comment, it helped to get rid of the --use-bases-mask error. But I am now facing another problem which is getting just one Fastq file Undetermined. Would you please suggest any help to get all of the Fastq files properly? Any checkpoints? Can this problem happen because of the Adapter sequence in my SampleSheet.csv file?

ADD REPLY
1
Entering edit mode

Double check your SampleSheet.csv above. Did you make it using Illumina's Experiment Manager? Your [Data] section headers do not appear to be right.

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
ADD REPLY
0
Entering edit mode

It is automatically generated by a sequencing instrument. Which columns should be removed? I also want to bring it to your notice that each group of samples (for example 8) have the same I7 indices and all of their I5 indices differ. [It is clear if you double-check my SampleSheet.csv file above]

ADD REPLY
1
Entering edit mode

If the sequencer made the samplesheet then it must be right for use with on-board demultiplexing but looks like you are using bcl2fastq off-sequencer.

If you see in your undetermined file that invariant index (which you think is i7) is actually in second position. So you may need to flip those indexes in your SampleSheet.csv. Try that out.

ADD REPLY
0
Entering edit mode

I just double-checked. Reported indices on the Undetermined Fastq file header do not exist in the SampleSheet.csv - Can this be the problem?

ADD REPLY
1
Entering edit mode

Definitely. Make sure you have the correct entries in SampleSheet.csv. Double check with whoever made the libraries if needed.

If nothing works then I will point you to some code I have here that will tell you a list of indexes present and their numbers: C: Demultiplexing reads with index present in the labels

ADD REPLY
0
Entering edit mode

Hi @genomax, if you saw my latest answer you are probably noticed that I got successful to assign reads to each sample.fastq - my question is now that my code gives results but I get very low reads for each fastq file - about 22kb for actual samples--and about 1.7gb for the undetermined fastq file. Would you help me to figure out what step can be more critical to take care of? With or without the --use-base-mask I get the same result. And I also provided the RC of the second index in the SampleSheet.csv.

ADD REPLY
1
Entering edit mode

That means something is still wrong. Use the code I mentioned in my comment above to figure out which sequences are ending up in the Undetermined file and work out what needs to be done.

ADD REPLY
0
Entering edit mode

I ran your code on the undetermined file and here is just the header of the sorted version (based on the counts). I do not have the AGATCTCG either in my i7 or i5 indices in the SampleSheet.csv file. Any recommendations?

GGGGGGGG+AGATCTCG:  2931876
GTGGGGGG+AGATCTCG:  978264
TGGGGGGG+AGATCTCG:  933397
TTGGGGGG+AGATCTCG:  792850
GCGGGGGG+AGATCTCG:  557570
CGGGGGGG+AGATCTCG:  505601
GGGGTGGG+AGATCTCG:  477725
TCGGGGGG+AGATCTCG:  414075
GGTGGGGG+AGATCTCG:  413108
CTGGGGGG+AGATCTCG:  404830
GGGGCGGG+AGATCTCG:  325213
GGCGGGGG+AGATCTCG:  285297
ADD REPLY
1
Entering edit mode

I am afraid something is not right. I have a feeling that index 1 read has failed (in order listed above). MiniSeq is a 2-color machine so all those G's indicate no signal.

As for AGATCTCG are you sure there was no error in your original SampleSheet with the invariant index? That is clearly what the sequencer sees but again, if there was an index read failure who knows ....

I am going to suggest that you contact Illumina tech support and have them take a look at this run remotely. If there was a machine/reagent issue then they will replace the reagents as long as you have a maintenance contract on this sequencer.

While waiting on a return call from Illumina, take a random set of reads from Undetermined file and blast them at NCBI see if they are all phiX by chance. If some belong to the genome you are working with then it would support my observation that there was some kind of index read failure. If they are all phiX then your libraries may have failed.

ADD REPLY
0
Entering edit mode

Thank you so much for your detailed inputs. I will consider all and will also update you about the final decision.

ADD REPLY
0
Entering edit mode

The read excerpts below indicate that the sequence associated with those indices is PhiX. I don't think those are supposed to have indices at all, so I think the missing index is okay.

ADD REPLY
0
Entering edit mode

Dear @genomax, by asking from our wet lab guy, I just noticed that the adapters assigned by the sequencing instrument belong to a different library preparation kit (Nextera) however for our experiment we used the Truseq CD library kit. Can this cause the problem? Should I remove the Nextera information from the SampleSheet?

ADD REPLY
0
Entering edit mode

If you think the wrong library got sequenced then replace the wrong library information in samplesheet and see if the samples demultiplex. If they do then you have your answer. It is possible that the wrong adapters were used by mistake to make the library (human error).

ADD REPLY
0
Entering edit mode

Hi again, after a long search we still lack how to proceed with this. We decided to run bcl2fastq software naively (not for demultiplexing) in order to get 3 different fastq files reporting actual inserts, index1 and index2 - do you have any experience with this kind of runs?

ADD REPLY
1
Entering edit mode

Did you talk with Illumina tech support? I think that is your best option. If this run had failed on index reads then you may be chasing ghosts trying to demultiplex this data.

You can run bcltofastq by using option --create-fastq-for-index-reads to create separate files for index reads. Don't provide any index info if you want to send all reads to "Undetermined" files.

ADD REPLY
0
Entering edit mode

Thank you so much for your kind help, much appreciated. I agree with you- I am also getting frustrated kind of. I have not heard anything from them yet but what we want to push this to its limit - we think probably the bcl2fastq do not catch correct index info from the samplesheet.csv. So with --create-fastq-for-index-reads I will only get fastq files for indexes? How about the actual inserts? Is there any option to get 3 separate fastq files (for the insert, index1, and index2 respectively)?

ADD REPLY
0
Entering edit mode

You will get three separate files. Read1, Index 1 and Index 2.

If the run has failed during sequencing then the data you have in hand would be completely unreliable. Examples of indexes you have posted above seems to indicate this outcome.

ADD REPLY
0
Entering edit mode

Using --create-fastq-for-index-reads is the final step for us. I think with these data we can definitely come up with a conclusion whether there were index read failures or not. Also some thin about the second index that I posted above, it is not among our index list (neither 1 nor 2).

ADD REPLY
0
Entering edit mode

Dear @genomax, here I have an update as well as a question for you. I ran bcl2fastq in a naive way with --create-fastq-for-index-reads. After the successful naive run, I randomly chose five index 1 indices + five index 2 indices + and their reverse complements and look at their abundances in these three files. It seems that I2 provides information about Index2, however, I1 does not give information about Index1. On the other side, we do detect I1 in R1 itself.

Question: Based on these results, I do not understand why, as the read starts right before the insert and can go until index 1. In theory index 2 should not be detected

Also one more thing I need to mentioned about the AGATCTCG index2 above. It is exactly the reverse complement sequence next to index primer binding position at TruSeq Universal P5 adapter.

ADD REPLY
1
Entering edit mode

If you are looking for things to try, try adding a project name, then your fastqs should end up in a folder with that name. (I also recommend not running with barcode-mismatches = 0. You will throw away a bunch of fine reads.)

ADD REPLY
0
Entering edit mode

My undetermined Fastq file header is:

@MN00153:75:000H37WNG:1:11102:8051:1085 1:N:0:CTGGNGGG+AGATCTCG
CATTGTAGCATTGTGCCNATTCATCCATTAACNTCTCANTAACAGATACAAACTCATCACGAACGTCAGNAGCAGCCTTATGNNCGTCAACATACATNTCACCATTATCGAACTCAACGNCCTGNATNCGAAAAGNCAGAATCTCTTCCAA
+
AFFFFFFFFFFFFFFFF#FFFFFFFFFFFFFF#FFFFF#FFFFFFFFFFFFFFFFFFFFFFFFFAFFFF#FFFFFFFFFFFF##FFFFFFFFFFFFF#FFFFFFFFFFFFFAFFFFFFA#FFAF#FF#/FFFFF/#FFFFFFFFFFFF=FF
@MN00153:75:000H37WNG:1:11102:16389:1085 1:N:0:GGGGNGGG+AGATCTCG
GAAGTGTCCGCATAAAGNGCACCGCATGGAAANGAAGANGGCCATTAGCTGTACCATACTCAGGCACACNAAAATACTGATANNAGTCGGCGTGTGANTCATTAGCCTTGCGACCCTCGNCAGCNAGNACCATACNACCAATATCACGAAA
+
FFFFFFFFFFFFFFFFF#FF/FF=FAFFFFFF#AFFFF#FFFFFFFFFFFFFFFAFFFFFFFFFFFFFF#FFFFFFFFFFFF##FFFFFFFFFFFFF#FFF/FFFAF/=FFFAFAF/FF#FFFF#F/#FFFFFFF#FFFFFFFFFA=AFFF
@MN00153:75:000H37WNG:1:11102:6711:1085 1:N:0:TTGGNGGG+AGATCTCG
TTACGACGCGACGCCGTNCAACCAGATATTGANGCAGANCGCAAAAAGAGAGATGAGATTGAGGCTGGGNAAAGTTACTGTANNCGACGTTTTGGCGNCGCAACCTGTGACGACAAATCNGCTCNAANTTATGCGNGCTTCGATAAAAATG
ADD REPLY
1
Entering edit mode

If you blast those sequences, you'll see they are PhiX.

ADD REPLY
0
Entering edit mode
3.5 years ago
Apex92 ▴ 280

Thank you for hints by @genomax and @swbarnes2.

I solved the problem and by sharing here I hope it will help others in the future. So basically for the sequencing with MiniSeq, NextSeq, and HiSeq3000/4000 the instrument goes with a different dual-indexing workflow which then requires the reverse complement (RC) of the second adapter sequence for the bcl2fastq conversion. And this RC should be provided inside the SampleSheet.csv file. An additional note is that even for the dual index bcl2fastq conversion we do not necessarily need to specify the --use-bases-mask option because bcl2fasq command line automatically searches for that information inside the RunInfo.xml (this detail is provided in the bcl2fastq2 Conversion Software v2.20 Software Guide (15051736)) .

Related post link .

ADD COMMENT

Login before adding your answer.

Traffic: 2968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6