bcl2fastq with xGen Dual Index UMI Adapters to produce 3 read and 2 index fastqs
3
1
Entering edit mode
2.4 years ago
segrossk ▴ 20

I have RNA-Seq from Illumina HiSeq using IDT xGen Dual Index UMI Adapters, where the UMI are added on the i7 index. Typically, I would use a pipeline using Picard which is recommended by IDT. However, I was hoping to use bcl2fastq (v2.17.1.14) to immediately produce fastqs.

Below is the molecular biology for the xGEN dual index with UMI library construct. Key thing to note is that the 9bp UMI is after the 8bp i7 index. We use the dual 8bp i5 and i7 index to aid with identify index hopping for the reads. The UMI aids with helping identify biological versus PCR duplicates.

The RunInfo.xml has 4 reads: read 1 76 cycles 'N' index, read 2 17 cycles 'Y' index, read 3 8 cycles 'Y' index, and read 4 76 cycles 'N' index. This would reflect a 76bp paired end sequencing with the i7 8bp index having a 9bp random UMI and i5 8bp dual index. The Samplesheet.csv has the corresponding i7 8bp and i5 8bp index sequences.

After reading a number of biostar posts and other resources, I found the bcl2fastq options that would generate 5 fastq: 1 for each index, 1 for the UMI, and 1 for each of the paired end read by setting --use-base-mask Y76,I8Y9,I8,Y9 --create-fastq-for-index-reads. The options for bcl2fastq used are posted below.

bcl2fastq --input-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN/Data/Intensities/BaseCells \
--runfolder-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN \
--output-dir /outputdirectorypath \
--sample-sheet /pathtoruns/hiseq/runs/IlluminaRunxGEN/SampleSheet.csv \
--barcode-mismatches 1 \
--ignore-missing-bcl \
--no-lane-splitting


However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads. All the other 4 fastq files look great. I have tried many different --use-base-mask options with and without altering the RunInfo.xml and SampleSheet.csv to try to get the UMI sequence but could use help in what I am doing wrong.

rna-seq • 4.9k views
0
Entering edit mode

Write your whole bcl2fastq command line. There might be other options you can add to fix this.

0
Entering edit mode

Could you post a few sequences and a mini diagram of what you expect like [Adapter] [UMI] [SEQ] .. ?

0
Entering edit mode

I don't think there's an easier way to do this, since Illumina UMIs are on read 1, not the index.

0
Entering edit mode

AFAIK bcl2fastq only handles UMI's that are in main R1/R2 reads. This is a past thread that has some options you can consider: Can Illumina bcl2fastq use only one index for demultiplexing dual index sequencing data?

0
Entering edit mode

genomax: Thank you for pointing out your post, which I came across and it did not resolve my issue of bcl2fastq placing 9bp N's instead of the sequence in the UMI files. The options that I use for bcl2fastq does everything correct except for the R2 fastq containing the UMI read. The R1 and R4 76bp read fastq files are correct and the I1 and 12 8bp read fastq files are correct.

Using the following --use-bases-mask options with the bcl2fasq command with the original RunInfo.xml and SampleSheet.csv creates the same 5 fastqs: Y,IIIIIIIIYYYYYYYYY,IIIIIIII,Y or Y,I8Y,I8,Y*. It is as though bcl2fastq only permits the index sequence to be generated as output even if there are read cycles beyond the base used for the index.

With that, I created an altered the SampleSheet.csv so that the i7 index had 8bp for the sample index and I added 9N's since Illumina Basespace notes that N's match any base. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,I8Y9,I8,Y76 failed to complete due to lengths not matching. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,Y*,I8,Y76 created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's. This may be because the RunInfo.xml still had R2 as in index.

Then, I created an altered SampleSheet so that it only had the i5 index and I altered the RunInfo.xml so that read 2 17 cycles 'N' index, thinking that bcl2fastq would see this as read cycles. I set the --use-base-mask Y76,Y17,I8,Y76. This also created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's.

I am for any additional suggestions for getting the sequence of the UMI using bcl2fastq or another method. I would rather not use the PICARD suggestion from IDT which requires generating a bam file and then converting that to a fastq. Again, thank you for your time and help.

0
Entering edit mode

Hello i'm curious to how your samplesheet was set up using bcl2fastq. This is also the first time I've come across the UMI situation for demultiplexing. My I had left my runxml alone for a paired in nextseq run Read 1 is 71 cycles read 2 is 17 cycles indexI7+UMI read 3 is 8 for the i5 index read 4 is 71

I had left the 9base NNNNNNNNN in the samplesheet and I was wondering if I should not have done that? thank you

0
Entering edit mode

Are you sure you are using Illumina's UMI's? see this post from up above in this thread: C: bcl2fastq with xGen Dual Index UMI Adapters to produce 3 read and 2 index fastqs

0
Entering edit mode

I believe so. I've never dealth with UMI's before but the scientist/bioinformatics person gave me this resource to get the indexes: https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xgen-dual-index-umi-adapters-tech-access

0
Entering edit mode

That is what the original poster of this thread was using. So you can use their description from original question along with --mask-short-adapter-reads 0 option to get separate files reads/UMI.

0
Entering edit mode

Thank you so much! This helped me greatly, After I fould the correct UMI indexes that is!

0
Entering edit mode

Typos in original post:
/BaseCells --> /BaseCalls
--ignore-missing-bcl --> --ignore-missing-bcls

5
Entering edit mode
2.4 years ago
GenoMax 112k

Since Illumina Basespace notes that N's match any base

BaseSpace is not the same as local bcl2fastq2 demux. Local bcl2fastq does not expandN to match any base.

However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads.

and

This also created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's.

That is because the UMI read is now smaller than 35 bp and the sequence is masked with N's. You will need to add --mask-short-adapter-reads 0 option to your bcl2fastq run to unmask the bases in UMI file. You will get basecalls and quality back after this step.

1
Entering edit mode

genomax, adding the --mask-short-adapter-reads 0 option to the bcl2fastq commands in the original post worked perfectly. I would not have thought of using that option. Thank you so much!

0
Entering edit mode

Having similar issues with bclfastq as original poster: Our setup: RunInfo.xml has 3 reads: read 1 120 cycles 'N'; read 2 8 cycles "N" index; read 3 8 cycles "Y" index. This would reflect a 120 singled end sequencing with i7 8bp (random) index (that serves as our UMI) and i5 8bp index. The Samplesheet.csv has the i5 8bp index sequences and nothing for the i7 index columns. We used setting --use-base-mask Y120, Y8, I8, --create-fastq-for-index, and --mask-short-adapter-reads 0. We are able to get the fastq outputs for R1, R2(i7), and I1 (i5), but the sequences in the R2 were all Ns (instead of actual sequences) Any guidance will be much appreciated!

0
Entering edit mode

In Illumina sequencing if index reads are present they always are sequenced in between the R1/R2 reads i.e. Read 1 --> Index 1 --> Index 2 --> Read 2. It is not possible to do Read 1 --> Read 2 --> Index 1.

If your run set up was 120 x 8 x 8 bp, you will have to set Read 2 as an index (even though it may not be) in your RunInfo.xml file. That would be the only way you can get sequence for all reads using the --create-fastq-for-index option. You will then change I1 file to R2 and then change I2 file to I1.

0
Entering edit mode

Just to clarify that would then require changing our use-base-mask option to Y120, I8, I8, Y* (the last Y* not having a read in our RunInfo.xml)? If so the issue we are facing is that our i7 is completely random (NNNNNNNN; 8Ns) which does not seem to work when treating it as an index. Our current sample sheet only has i5 barcodes (corresponding to different samples) and no i7 index specified. We found that specifying NNNNNNNN as the barcode for i7 in our sample sheet results in the program literally looking for Ns as opposed to finding the random 8bp at the i7 index position (which is what we actually want).

Thank you!

0
Entering edit mode

Your best bet is to let all reads go to "Undetermined" pool and collect the three files. You can do that by providing N's for indexes with a single dummy entry in your SampleSheet.csv. You will also need to change RunInfo.xml to: read 1 120 cycles 'N'; read 2 8 cycles "Y" index; read 3 8 cycles "Y" index.

Then rename your files like I suggested above (you may need to rename fastq read headers to match) and then use something like demuxbyname.sh from BBMap suite or deML to demux your data.

0
Entering edit mode
8 months ago
szutre ▴ 10

I usually remove Ns from the sample sheet and then run this command

$bcl2fastq -i Runfolder/Data/Intensities/BaseCalls -R Runfolder -o output_directory --sample-sheet Runfolder/SampleSheet.csv -r number_of_threads -p number_of_threads -w number_of_threads --no-lane-splitting --mask-short-adapter-reads 0 --use-bases-mask Y*,I8Y9,I8,Y*  Original sample sheet: Sample_ID,Sample_Name,Description,index,I7_Index_ID,index2,I5_Index_ID,Sample_Project Sample1,Sample1,,TACCGAGGNNNNNNNNN,TACCGAGGNNNNNNNNN,AGTTCAGG,AGTTCAGG,BRCA1 Sample2,Sample1,,CGTTAGAANNNNNNNNN,CGTTAGAANNNNNNNNN,GACCTGAA,GACCTGAA,BRCA1  Formatted sample sheet: Sample_ID,Sample_Name,Description,index,I7_Index_ID,index2,I5_Index_ID,Sample_Project Sample1,Sample1,,TACCGAGG,TACCGAGG,AGTTCAGG,AGTTCAGG,BRCA1 Sample2,Sample1,,CGTTAGAA,CGTTAGAA,GACCTGAA,GACCTGAA,BRCA1  Our RunInfo.xml file looks like this: <Reads> <Read Number="1" NumCycles="147" IsIndexedRead="N" /> <Read Number="2" NumCycles="17" IsIndexedRead="Y" /> <Read Number="3" NumCycles="8" IsIndexedRead="Y" /> <Read Number="4" NumCycles="146" IsIndexedRead="N" /> </Reads>  ADD COMMENT 0 Entering edit mode This is what original poster was doing. ADD REPLY 0 Entering edit mode In the original post the person was not removing the Ns from the samplesheet and that's why he/she was getting Ns in the UMI file. ADD REPLY 0 Entering edit mode 3 months ago I have a work around as follows: Edit the RunInfo.xml near the top where the reads are defined for the run to create a new 'read' for your UMI sequence e.g. <Reads> <Read Number="1" NumCycles="76" IsIndexedRead="N" /> <Read Number="2" NumCycles="17" IsIndexedRead="Y" /> <Read Number="3" NumCycles="8" IsIndexedRead="Y" /> <Read Number="4" NumCycles="76" IsIndexedRead="N" /> </Reads>  becomes: <Reads> <Read Number="1" NumCycles="76" IsIndexedRead="N" /> <Read Number="2" NumCycles="8" IsIndexedRead="Y" /> <Read Number="3" NumCycles="9" IsIndexedRead="N" /> <Read Number="4" NumCycles="8" IsIndexedRead="Y" /> <Read Number="5" NumCycles="76" IsIndexedRead="N" /> </Reads>  Just include the 8 bases of the I7 index sequence in the sample sheet, don't put any 'N's in for the UMI sequence or bcl2fastq won't run. then: $nohup bcl2fastq --minimum-trimmed-read-length 8 --mask-short-adapter-reads 8


will create 3 FASTQ (R1, R2 and R3) where R2 is the 9 base UMI sequence and R3 is sequence read 2. I5 and I7 index sequences will be contained within R1 and R3 FASTQ as usual.