I have RNA-Seq from Illumina HiSeq using IDT xGen Dual Index UMI Adapters, where the UMI are added on the i7 index. Typically, I would use a pipeline using Picard which is recommended by IDT. However, I was hoping to use bcl2fastq (v2.17.1.14) to immediately produce fastqs.
Below is the molecular biology for the xGEN dual index with UMI library construct. Key thing to note is that the 9bp UMI is after the 8bp i7 index. We use the dual 8bp i5 and i7 index to aid with identify index hopping for the reads. The UMI aids with helping identify biological versus PCR duplicates.
More information about the xGen Duel Index UMI adapter library construction can be found at the IDT link below. https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xgen-dual-index-umi-adapters-tech-access
The RunInfo.xml has 4 reads: read 1 76 cycles 'N' index, read 2 17 cycles 'Y' index, read 3 8 cycles 'Y' index, and read 4 76 cycles 'N' index. This would reflect a 76bp paired end sequencing with the i7 8bp index having a 9bp random UMI and i5 8bp dual index. The Samplesheet.csv has the corresponding i7 8bp and i5 8bp index sequences.
After reading a number of biostar posts and other resources, I found the bcl2fastq options that would generate 5 fastq: 1 for each index, 1 for the UMI, and 1 for each of the paired end read by setting --use-base-mask Y76,I8Y9,I8,Y9 --create-fastq-for-index-reads. The options for bcl2fastq used are posted below.
bcl2fastq --input-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN/Data/Intensities/BaseCells \
--runfolder-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN \
--output-dir /outputdirectorypath \
--sample-sheet /pathtoruns/hiseq/runs/IlluminaRunxGEN/SampleSheet.csv \
--barcode-mismatches 1 \
--with-failed-reads \
--use-base-mask Y76,I8Y9,I8,Y9 \
--create-fastq-for-index-reads \
--ignore-missing-bcl \
--no-lane-splitting
However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads. All the other 4 fastq files look great. I have tried many different --use-base-mask options with and without altering the RunInfo.xml and SampleSheet.csv to try to get the UMI sequence but could use help in what I am doing wrong.
Thank you in advance.
Write your whole bcl2fastq command line. There might be other options you can add to fix this.
Could you post a few sequences and a mini diagram of what you expect like [Adapter] [UMI] [SEQ] .. ?
I don't think there's an easier way to do this, since Illumina UMIs are on read 1, not the index.
AFAIK bcl2fastq only handles UMI's that are in main R1/R2 reads. This is a past thread that has some options you can consider: Can Illumina bcl2fastq use only one index for demultiplexing dual index sequencing data?
genomax: Thank you for pointing out your post, which I came across and it did not resolve my issue of bcl2fastq placing 9bp N's instead of the sequence in the UMI files. The options that I use for bcl2fastq does everything correct except for the R2 fastq containing the UMI read. The R1 and R4 76bp read fastq files are correct and the I1 and 12 8bp read fastq files are correct.
Using the following --use-bases-mask options with the bcl2fasq command with the original RunInfo.xml and SampleSheet.csv creates the same 5 fastqs: Y,IIIIIIIIYYYYYYYYY,IIIIIIII,Y or Y,I8Y,I8,Y*. It is as though bcl2fastq only permits the index sequence to be generated as output even if there are read cycles beyond the base used for the index.
With that, I created an altered the SampleSheet.csv so that the i7 index had 8bp for the sample index and I added 9N's since Illumina Basespace notes that N's match any base. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,I8Y9,I8,Y76 failed to complete due to lengths not matching. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,Y*,I8,Y76 created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's. This may be because the RunInfo.xml still had R2 as in index.
Then, I created an altered SampleSheet so that it only had the i5 index and I altered the RunInfo.xml so that read 2 17 cycles 'N' index, thinking that bcl2fastq would see this as read cycles. I set the --use-base-mask Y76,Y17,I8,Y76. This also created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's.
I am for any additional suggestions for getting the sequence of the UMI using bcl2fastq or another method. I would rather not use the PICARD suggestion from IDT which requires generating a bam file and then converting that to a fastq. Again, thank you for your time and help.
Hello i'm curious to how your samplesheet was set up using bcl2fastq. This is also the first time I've come across the UMI situation for demultiplexing. My I had left my runxml alone for a paired in nextseq run Read 1 is 71 cycles read 2 is 17 cycles indexI7+UMI read 3 is 8 for the i5 index read 4 is 71
I had left the 9base NNNNNNNNN in the samplesheet and I was wondering if I should not have done that? thank you
Are you sure you are using Illumina's UMI's? see this post from up above in this thread: C: bcl2fastq with xGen Dual Index UMI Adapters to produce 3 read and 2 index fastqs
I believe so. I've never dealth with UMI's before but the scientist/bioinformatics person gave me this resource to get the indexes: https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xgen-dual-index-umi-adapters-tech-access
That is what the original poster of this thread was using. So you can use their description from original question along with
--mask-short-adapter-reads 0
option to get separate files reads/UMI.Thank you so much! This helped me greatly, After I fould the correct UMI indexes that is!
Typos in original post:
/BaseCells --> /BaseCalls
--use-base-mask --> --use-bases-mask
--ignore-missing-bcl --> --ignore-missing-bcls