Question: bcl2fastq2: how to correctly use the --use-bases-mask for different sequencing methods by Illumina ?
0
gravatar for badredda
17 months ago by
badredda130
badredda130 wrote:

Hello,

I need your help to address the parameter found in bcl2fastq2 tool when demultiplexing data generated by Illumina's sequencers. As you know, there are different ways to sequence genomic data but mostly by doing Paired-End (PE) or Single-End (SE) sequencing. Plus, to sequence the data, you have to use single-indexing or double (or dual) indexing on the reads. As per Illumina's definition:

Single and Dual Indexing

The number of index sequences added to samples differs for single-indexed and dual-indexed sequencing.

Single-indexed libraries — Adds up to 48 unique six-base Index 1 (i7) sequences to generate up to 48 uniquely tagged libraries.

Dual-indexed libraries — Adds up to 24 unique eight-base Index 1 (i7) sequences and up to 16 unique eight-base Index 2 (i5) sequences, generating up to 384 uniquely tagged libraries. The IDT for Illumina TruSeq UD Indexes are provided as index pairs and can generate up to 96 uniquely tagged libraries. These indexes add up to 96 unique eight-base Index 1 sequences and up to 96 unique eight-base Index 2 indexes.

During indexed sequencing, the index is sequenced in a separate read, called the Index Read, where a new sequencing primer is annealed. When libraries are dual-indexed, the sequencing run includes two additional reads, called the Index 1 Read and Index 2 Read.

Knowing this, I have two questions:

  1. Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?
  2. How can we demultiplex such data since the file generated by the sequencer (RunInfo.xml) contains configuration for a dual index run ? In other words, demultiplexing lanes that have dual index works fine when providing the RunInfo.xml, but for single index, what should I use for the --use-bases-mask parameter ?

Also, I know that for --use-bases-mask, we can use the following parameters for different types of sequencing:

  • Single-End sequencing: Y * ,I6N *
  • Paired-End sequencing:

    • Dual-Indexing: Y\*,I\*,I\* ,Y\*
    • No Index: Y\*,Y\* (Thanks to Devon Ryan)
    • Single Indexing: Y\*,I6N,Y\* (Thanks to Devon Ryan)
    • In-read barcode in the first read for some of the samples, but the run was PE dual-index: I5Y*,N*,N*,Y* (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit: Y98,Y14,I8,Y10 (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit + more standard libraries on the same run: Y98N*,Y14N*,I8N*,Y10N* (Thanks to igor)

    Also, could you please state what other types of parameters could be used in different cases ? (for future readers)

Thanks for your time and help. Don't forget to upvote this post please so users can find this post.

ADD COMMENTlink modified 12 months ago • written 17 months ago by badredda130
2
gravatar for Devon Ryan
17 months ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:
  1. Yes, though bcl2fastq2 won't be able to handle it in a single step. We commonly do this and we then process each flow cell in compatible chunks, using --tiles. As an example, if the first two lanes of a flow cell have compatible indices (both in number and length) then you need --tiles s_1,s_2. You then also need multiple output directories per flow cell.
  2. See above. In short, you use one --use-bases-mask at a time.

Note that unless you have a mixture of either barcode lengths between lanes or barcode strategies (dual vs. single) you don't actually need --use-bases-mask at all.

For PE and no index you would could use --use-bases-mask Y*,Y*, unless you used an index run. For a single index it'd then be Y*,I6N,Y*.

ADD COMMENTlink written 17 months ago by Devon Ryan94k

Dear Ryan,

Thanks for your reply. The single index has a 6 base pairs length while the dual index has an 8 and all indexes are differnet from one to another. Let's take this RunInfo.xml as example (uploaded on my Google Drive):

https://drive.google.com/open?id=1EJHnNuTyW8BfDLdE4yoBxp78rw8bYsHF

How can I proceed, knowing that for example, lane 5 and 6 are the single index data ?

Thanks

ADD REPLYlink written 17 months ago by badredda130
1

--use-bases-mask Y*,I6nn,nnnnnnnn,Y* in that case.

ADD REPLYlink written 17 months ago by Devon Ryan94k
1

badredda you could use a separate --use-bases-mask for lanes 5 and 6 and then a different one for other lanes.

ADD REPLYlink written 17 months ago by genomax80k
1

I'm passing for a problem like this one, could you help me?

my RunInfo.xml:

<?xml version="1.0"?>
<RunInfo xmlns:xsd="..." xmlns:xsi="..." Version="4">
  <Run Id="190219_NB500954_0035_AHGMJVAFXY" Number="35">
    <Flowcell>HGMJVAFXY</Flowcell>
    <Instrument>NB500954</Instrument>
    <Date>190219</Date>
    <Reads>
      <Read Number="1" NumCycles="151" IsIndexedRead="N" />
      <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="4" NumCycles="151" IsIndexedRead="N" />
    </Reads>
    <FlowcellLayout LaneCount="4" SurfaceCount="2" SwathCount="1" TileCount="12" SectionPerLane="3" Lane
PerSection="2">
      <TileSet TileNamingConvention="FiveDigit">
        <Tiles>
          <Tile>1_11101</Tile>
          <Tile>1_21101</Tile>
          <Tile>1_11102</Tile>
          ...
          <Tile>4_11612</Tile>
          <Tile>4_21612</Tile>
        </Tiles>
      </TileSet>
    </FlowcellLayout>
    <ImageDimensions Width="2592" Height="1944" />
    <ImageChannels>
      <Name>Red</Name>
      <Name>Green</Name>
    </ImageChannels>
  </Run>
</RunInfo>

We normally use a 151x8x8x151 amplicon panel, but we added a single indexed panel with 12 index length, I had tried --use-bases-mask Y*,I12,,Y* but I receive the error above:

2019-02-25 21:45:12 [7faca61f4780] ERROR: bcl2fastq::common::Exception: 2019-Feb-25 21:45:12: Success (0): /tmp/bcl2fastq/bcl2fastq/src/cxx/lib/layout/Layout.cpp(378): Throw in function void bcl2fastq::layout::setIndexReadMetadata(const std::vector<long unsigned int>&, bcl2fastq::layout::ReadMetadata&, size_t)
Dynamic exception type: boost::exception_detail::clone_impl<bcl2fastq::common::InputDataError>
std::exception::what: Barcodes in sample sheet are longer than the index length found in RunInfo.xml.

I have tried to change the RunInfo.xml index values to 12 as:

<Read Number="2" NumCycles="12" IsIndexedRead="Y" />
<Read Number="3" NumCycles="12" IsIndexedRead="Y" />

But my FASTQs were empty, any help?

ADD REPLYlink modified 13 months ago • written 13 months ago by geocarvalho130
4

If you only ran 8 bases for the first index, that's all you've got. You can't invent data you don't have by futzing with the command line.

ADD REPLYlink written 13 months ago by swbarnes27.5k
1

What the single indexed panel the only one on the flow cell or was it mixed with normal length indices? Was it actually 12 bases, or did you dual index it with 6 base indices? If the former is the case then only the first 12 bases of the barcode were actually read and it's going to end up in the undetermined indices no matter what you do. You can write a bit of python to retrieve it then.

ADD REPLYlink written 13 months ago by Devon Ryan94k

Thank you guys! It was mixed with normal length indices (8 bases), and it was 12 bases on one side. The python algorithm should open the Undetermined FASTQ and search for the reads with the possible index in the header?

ADD REPLYlink modified 13 months ago • written 13 months ago by geocarvalho130
1

As @swbarnes2 pointed out above looking at your RunInfo.xml file this run was set up as 151x8x8x151.

<Reads>
  <Read Number="1" NumCycles="151" IsIndexedRead="N" />
  <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="4" NumCycles="151" IsIndexedRead="N" />
</Reads>

i.e. with 8 cycles on index 1 and 8 cycles on Index 2. There is NO way to recover data for 12 cycles for Index 1 since those additional 4 cycles were never sequenced.

If 8 bp from Index 1 that were sequenced are discriminatory enough you may be able to recover data but otherwise this run will have to be repeated for the samples with 12 bp indexes.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax80k

Thanks, genomax. The 8 bp from index 1 were specific enough to recover than, so I just adjusted the sample sheet used. Best regards.

ADD REPLYlink modified 13 months ago • written 13 months ago by geocarvalho130
2
gravatar for igor
17 months ago by
igor9.8k
United States
igor9.8k wrote:

could you please state what other types of parameters could be used in different cases ?

Any parameters are possible. The parameter specifies how you want to interpret the actual sequencing output. You have to make sure that the number of reads and their lengths matches what was ran.

You can use different --use-bases-mask for different lanes or just provide a sample sheet for the lanes that you are interested in.

There are many odd library options. For a hypothetical example, you may have an in-read barcode in the first read for some of the samples, but the run was PE dual-index. Then you might have: I5Y*,N*,N*,Y* (treat first 5 bases of R1 as index and the rest as actual read, ignore I1 and I2, then treat R2 as normal read).

For a real life example, 10x Genomic Single Cell 3' v1 kit required this: Y98,Y14,I8,Y10. This used the second index read as the bcl2fastq index, but kept the other reads for additional processing with more specialized software (Cell Ranger). If you had other more standard libraries on the same run, you would need to add Ns to ignore additional bases: Y98N*,Y14N*,I8N*,Y10N*.

ADD COMMENTlink modified 17 months ago • written 17 months ago by igor9.8k
2
gravatar for swbarnes2
17 months ago by
swbarnes27.5k
United States
swbarnes27.5k wrote:

Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?

Yes. I do this all the time. Without messing with base masking or subsetting by lane/tile.

Did you try it the easy way first?

ADD COMMENTlink written 17 months ago by swbarnes27.5k

Does that work now? It used to break bcl2fastq2.

ADD REPLYlink written 17 months ago by Devon Ryan94k

I frequently have a mix of samples on one flow cell, some with two indices, some with one. I used to break up into two sample sheets, but I don't now, and it works fine. I can't remember testing having indices of differing lengths, but I think that will work too.

ADD REPLYlink written 17 months ago by swbarnes27.5k

Interesting, I wonder when Illumina enabled this, it would seriously simplify my demultiplexing workflow :)

ADD REPLYlink written 17 months ago by Devon Ryan94k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour