Question: bcl2fastq with xGen Dual Index UMI Adapters to produce 3 read and 2 index fastqs
0
gravatar for segrossk
4 weeks ago by
segrossk0
segrossk0 wrote:

I have RNA-Seq from Illumina HiSeq using IDT xGen Dual Index UMI Adapters, where the UMI are added on the i7 index. Typically, I would use a pipeline using Picard which is recommended by IDT. However, I was hoping to use bcl2fastq (v2.17.1.14) to immediately produce fastqs.

Below is the molecular biology for the xGEN dual index with UMI library construct. Key thing to note is that the 9bp UMI is after the 8bp i7 index. We use the dual 8bp i5 and i7 index to aid with identify index hopping for the reads. The UMI aids with helping identify biological versus PCR duplicates.

More information about the xGen Duel Index UMI adapter library construction can be found at the IDT link below. https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xgen-dual-index-umi-adapters-tech-access

The RunInfo.xml has 4 reads: read 1 76 cycles 'N' index, read 2 17 cycles 'Y' index, read 3 8 cycles 'Y' index, and read 4 76 cycles 'N' index. This would reflect a 76bp paired end sequencing with the i7 8bp index having a 9bp random UMI and i5 8bp dual index. The Samplesheet.csv has the corresponding i7 8bp and i5 8bp index sequences.

After reading a number of biostar posts and other resources, I found the bcl2fastq options that would generate 5 fastq: 1 for each index, 1 for the UMI, and 1 for each of the paired end read by setting --use-base-mask Y76,I8Y9,I8,Y9 --create-fastq-for-index-reads. The options for bcl2fastq used are posted below.

bcl2fastq --input-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN/Data/Intensities/BaseCells \
               --runfolder-dir /pathtoruns/hiseq/runs/IlluminaRunxGEN \
               --output-dir /outputdirectorypath \
               --sample-sheet /pathtoruns/hiseq/runs/IlluminaRunxGEN/SampleSheet.csv \
               --barcode-mismatches 1 \
               --with-failed-reads \
               --use-base-mask Y76,I8Y9,I8,Y9 \
               --create-fastq-for-index-reads \
               --ignore-missing-bcl \
               --no-lane-splitting

However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads. All the other 4 fastq files look great. I have tried many different --use-base-mask options with and without altering the RunInfo.xml and SampleSheet.csv to try to get the UMI sequence but could use help in what I am doing wrong.

Thank you in advance.

rna-seq • 222 views
ADD COMMENTlink modified 4 weeks ago by swbarnes26.7k • written 4 weeks ago by segrossk0

Write your whole bcl2fastq command line. There might be other options you can add to fix this.

ADD REPLYlink written 4 weeks ago by swbarnes26.7k

Could you post a few sequences and a mini diagram of what you expect like [Adapter] [UMI] [SEQ] .. ?

ADD REPLYlink written 4 weeks ago by Gabriel R.2.6k

I don't think there's an easier way to do this, since Illumina UMIs are on read 1, not the index.

ADD REPLYlink written 4 weeks ago by Devon Ryan92k

AFAIK bcl2fastq only handles UMI's that are in main R1/R2 reads. This is a past thread that has some options you can consider: Can Illumina bcl2fastq use only one index for demultiplexing dual index sequencing data?

ADD REPLYlink written 4 weeks ago by genomax73k

genomax: Thank you for pointing out your post, which I came across and it did not resolve my issue of bcl2fastq placing 9bp N's instead of the sequence in the UMI files. The options that I use for bcl2fastq does everything correct except for the R2 fastq containing the UMI read. The R1 and R4 76bp read fastq files are correct and the I1 and 12 8bp read fastq files are correct.

Using the following --use-bases-mask options with the bcl2fasq command with the original RunInfo.xml and SampleSheet.csv creates the same 5 fastqs: Y,IIIIIIIIYYYYYYYYY,IIIIIIII,Y or Y,I8Y,I8,Y*. It is as though bcl2fastq only permits the index sequence to be generated as output even if there are read cycles beyond the base used for the index.

With that, I created an altered the SampleSheet.csv so that the i7 index had 8bp for the sample index and I added 9N's since Illumina Basespace notes that N's match any base. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,I8Y9,I8,Y76 failed to complete due to lengths not matching. Using the 'UMI Altered i7 Index' SampleSheet and the --use-base-mask Y76,Y*,I8,Y76 created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's. This may be because the RunInfo.xml still had R2 as in index.

Then, I created an altered SampleSheet so that it only had the i5 index and I altered the RunInfo.xml so that read 2 17 cycles 'N' index, thinking that bcl2fastq would see this as read cycles. I set the --use-base-mask Y76,Y17,I8,Y76. This also created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's.

I am for any additional suggestions for getting the sequence of the UMI using bcl2fastq or another method. I would rather not use the PICARD suggestion from IDT which requires generating a bam file and then converting that to a fastq. Again, thank you for your time and help.

ADD REPLYlink written 4 weeks ago by segrossk0
2
gravatar for genomax
4 weeks ago by
genomax73k
United States
genomax73k wrote:

Since Illumina Basespace notes that N's match any base

BaseSpace is not the same as local bcl2fastq2 demux. Local bcl2fastq does not expandN to match any base.

However, the fastq for the UMI only contains NNNNNNNNN for the sequence line and ######### for the CIGAR line for each of the demultiplexed reads.

and

This also created 3 read and 1 index file, but the R2 i7 index UMI fastq only had 17 N's.

That is because the UMI read is now smaller than 35 bp and the sequence is masked with N's. You will need to add --mask-short-adapter-reads 0 option to your bcl2fastq run to unmask the bases in UMI file. You will get basecalls and quality back after this step.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by genomax73k

genomax, adding the --mask-short-adapter-reads 0 option to the bcl2fastq commands in the original post worked perfectly. I would not have thought of using that option. Thank you so much!

ADD REPLYlink written 4 weeks ago by segrossk0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2539 users visited in the last hour