Question

bcl2fastq conversion with specifying exact match of indices

0

Entering edit mode

8 months ago

Apex92 ▴ 280

Hello everyone,

I recently ran a NextSeq 2000 using 6-nucleotide Illumina TruSeq unique indices. My goal is to demultiplex using bcl2fastq and extract only the reads that match my indices. However, there's a complication: the run included samples from another person whose 8-nucleotide R1 indices overlap with my indices.

I'm looking for advice on:

How can I effectively run the bcl2fastq tool to extract only the reads that exhibit a precise match with my own 6-nucleotide indices?
Is it possible to execute the demultiplexing process while concurrently generating distinct fastq files for the indices? This would let me match index entries to reads in the sample fastq files, enabling removal based on the index fastq file.

Any insights on these methods would be appreciated. Thank you.

genome RNA-Seq sequencing • 1.8k views

ADD COMMENT • link updated 8 months ago by GenoMax 141k • written 8 months ago by Apex92 ▴ 280

0

Entering edit mode

How similar are the 8mers to your 6mers? Like perfect overlap and just 2 bases longer or "similar"?

ADD REPLY • link 8 months ago by ATpoint 82k

0

Entering edit mode

It is a perfect overlap like this "ATCGAA" vs "ATCGAAGG". Basically, the last two nucleotides are different.

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Did you run the actual run with 6 cycles on index? Are these single index samples?

If the answer is yes to both then you are not going to be able to discern the sample during demultiplexing. If the other persons samples are a different species then you may need to use bbsplit.sh to bin the reads. That would be about the best you can do.

If you ran the run with 8 index cycles then you should be able to separate the samples.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

This is the run information:

Flowcell>AACNGNHM5</Flowcell>
        <Instrument>VH00349</Instrument>
        <Date>2023-06-19T13:22:58Z</Date>
        <Reads>
            <Read Number="1" NumCycles="61" IsIndexedRead="N" IsReverseComplement="N"/>
            <Read Number="2" NumCycles="8" IsIndexedRead="Y" IsReverseComplement="N"/>
            <Read Number="3" NumCycles="8" IsIndexedRead="Y" IsReverseComplement="Y"/>
            <Read Number="4" NumCycles="61" IsIndexedRead="N" IsReverseComplement="N"/>
        </Reads>
        <FlowcellLayout LaneCount="1" SurfaceCount="2" SwathCount="6" TileCount="11">

In my samples, R1 reads indices overlap with other person indices. I can demultiplex using only the R1 index but other person demultiplexing should be done providing R1 and R2 indices. Thus he does not have a problem retrieving his reads but I do because my 6nts indices overlap as described above.

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

I already ran demultiplexing using my indices but then I have many reads that do not belong to my experiment. This is because my indices completely overlap with the 8nts indices as: "ATCGAA" vs "ATCGAAGG". How I can overcome this?

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Here is my samplesheet structure:

Sample_ID,Sample_Name,Description,index,I7_Index_ID
H2,H2,,ATCGAA,RPI_6
P4,P4,,CAGATC,RPI_7
S1,S1,,ACTTGA,RPI_8
sH2,sH2,,GATCAG,RPI_9
sP4,sP4,,TAGCTT,RPI_10
sS1,sS1,,GGCTAC,RPI_11

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Since you only have 6 bp indexes, your samples should show up with an extra AT (i don't immediately recall the two bases) so look for these. You could then add AT in your indexes to differentiate your samples from others. Your samples will also show a phantom index that is not a real i5 index. So that can also be added to the samplesheet.

This is going to take some finagling to sort out. As you have realized, it is not a good idea to have overlapping indexes in a run and a mix of 1D and 2D indexed samples.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Sorry I did not get it. My indexes are 6nts and samples that are not mine have 8nts indexes that have AT as extra nucleotides at the end. Is there any straightforward way to retrieve my reads? I was thinking of running the conversion in a way to report indexes (8nts) for each entry and then I only keep read entries that 6nts of reported indexes match my indexes and the last two nucleotides are not AT. what do you think?

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Now I am confused. Other person who had samples on this flowcell also had 6 bp indexes? I thought they had 8 bp dual indexes and you have 6 bp single indexes. Is that not the case?

When you run sequencing longer than the actual index length those extra bases show up (I think they are generally AT). If both of you had indexes of identical length then your only option is to separate the reads based on alignments, assuming the genomes are different enough.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

yes that is true, they had 8 bp dual indexes and I have 6 bp single indexes. So with what I said in my previous comment, is there any suitable way?

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Can you run the code that is here : Demultiplexing reads with index present in the labels and show us what combination of indexes are present in your data. Do this preferably with non-demultiplexed data. You can create non-demux data by using a blank samplesheet (without any sample lines). That will put all reads in "Undetermined" files.

I am going to posit that reads that have correct index 1 but the non-real index 2 are going to be yours.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thank you for your input. I ran demultiplexing with a blank samplesheet and the --create-fastq-for-index-reads option. Now I have R1 and R2 Undetermined reads together with fastqs for the indexes (8nts). Now I think I can look for my indexes in the R1 index fastq file that their first 6 nucleotides match with my indexes and the last two are not AT and those are my reads. What do you think?

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Please show the output of the awk command I had asked you to run.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

I ran it on the Undetermined_R1 fatsq file but I got this result

0: 267920787

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

That is odd. Are the index sequences not in the headers? Can you show one example header?

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

This it the head of Undetermined_R1:

@VH00349:133:AACNGNHM5:1:1101:18231:1000 1:N:0:0
CTTCCTCGGCCTCCTCCTCAGCGGCNNNNNNNNNGGNGGCAGCAGCCTCTCGGGGGGTAGG
+
CCCCCCCCCCCCCC;CCC;CCC;;C#########;C#CCC;-C--CCCC-CCC--CC-;C-
@VH00349:133:AACNGNHM5:1:1101:18269:1000 1:N:0:0
TGATTTCGTCCAATTCAGCTGGCGCNNNNNNNNNGGNGGCAGGCCCGTCTGCGACGGTCTT
+
CCCCCCCCCCC;CCCCCCCCC;-CC#########CC#CCCCCC--CCCCCCCCCCCCCCCC

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

But by running the demultiplexing step without providing any index, I did get fastq files of R1 and R2 indexes. Here is the head of R1 index fastq file:

@VH00349:133:AACNGNHM5:1:1101:18231:1000 1:N:0:0
ACTTGAAT
+
CC-C-CCC
@VH00349:133:AACNGNHM5:1:1101:18269:1000 1:N:0:0
GGCTACAT
+
CCCCCCCC

Do you think that I can look for my indexes in the R1 index fastq file that their first 6 nucleotides match with my indexes and the last two are not AT and those are my reads?

ADD REPLY • link 8 months ago by Apex92 ▴ 280

0

Entering edit mode

Sorry my apologies. I should have said run the demultiplexing with one dummy sample name that looks something like this:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
Dummy,Dummy,,,,NNNNNNNN,,NNNNNNNN,,

We are using 8 N's because we need to get all 8 bases that were sequenced for both indexes irrespective of the sample. This should properly populate the fastq headers with index sequences. Then run the awk script on these files.

ADD REPLY • link 8 months ago by GenoMax 141k