I have RNA-Seq from Illumina HiSeq using IDT xGen Dual Index UMI Adapters, where the UMI are added on the i7 index. Typically, I would use a pipeline using Picard which is recommended by IDT. However, I was hoping to use bcl2fastq (v126.96.36.199) to immediately produce fastqs.
Below is the molecular biology for the xGEN dual index with UMI library construct. Key thing to note is that the 9bp UMI is after the 8bp i7 index. We use the dual 8bp i5 and i7 index to aid with identify index hopping for the reads. The UMI aids with helping identify biological versus PCR duplicates.
More information about the xGen Duel Index UMI adapter library construction can be found at the IDT link below. https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xgen-dual-index-umi-adapters-tech-access
The RunInfo.xml has 4 reads: read 1 76 cycles 'N' index, read 2 17 cycles 'Y' index, read 3 8 cycles 'Y' index, and read 4 76 cycles 'N' index. This would reflect a 76bp paired end sequencing with the i7 8bp index having a 9bp random UMI and i5 8bp dual index. The Samplesheet.csv has the corresponding i7 8bp and i5 8bp index sequences.
After reading a number of biostar posts and other resources, I found the bcl2fastq options that would generate 5 fastq: 1 for each index, 1 for the UMI, and 1 for each of the paired end read by setting --use-base-mask Y76,I8Y9,I8,Y9 --create-fastq-for-index-reads. The options for bcl2fastq used are posted below.
bcl2fastq --input-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN/Data/Intensities/BaseCells \ --runfolder-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN \ --output-dir /outputdirectorypath \ --sample-sheet /pathtoruns/hiseq/runs/IlluminaRunxGEN/SampleSheet.csv \ --barcode-mismatches 1 \ --with-failed-reads \ --use-base-mask Y76,I8Y9,I8,Y9 \ --create-fastq-for-index-reads \ --ignore-missing-bcl \ --no-lane-splitting
However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads. All the other 4 fastq files look great. I have tried many different --use-base-mask options with and without altering the RunInfo.xml and SampleSheet.csv to try to get the UMI sequence but could use help in what I am doing wrong.
Thank you in advance.