Question

Concatenate raw bcl files and demultiplex samples with UMIs

0

Entering edit mode

3.9 years ago

sr41489 • 0

Hello all,

I'm fairly new to bioinformatics, so please bear with me as I try to articulate this issue I'm facing at my lab. I recently ran an RNA-seq on Illumina's NextSeq platform. With that, I have specified the run to be as follows: R1: 146, Index1+UMI: 17, Index2: 8, R2: 146.

Given those parameters, the run resulted in 317 bcl files (.bcl.bgzf) per lane which makes sense with the parameters above (146 + 17 + 8 + 146 = 317).

Now, what I'd like to do from here is demultiplex each barcode, convert to FASTQ or BAM (either one will work for me), and proceed into enrichment analysis with BaseSpace.

If I'm mistaken on any of these operations, please advise. I have found many resources on which task to complete first (e.g. concat bcls --> demux --> UMI extraction, and various combinations of each task) and so I'm just a bit confused on how to go about doing this. Any advice is appreciated and I can try to clarify anything further if necessary.

Thank you!

RNA-Seq UMIs demultiplex • 2.5k views

ADD COMMENT • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by sr41489 • 0

1

Entering edit mode

Please ask your sequence provider to demultiplex the files using bcl2fastq. This would be easiest way. If you are planning to use BaseSpace then you can ask them to transfer your run there directly for demultiplexing.

Note: It is not as simple as concat bcls and demux. You need access to the full flowcell folder for bcl2fastq to demultiplex.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for this explanation. I have access to the full flow cell folder for this, but now I have another question that might make things easier. I found that I have "undetermined" FASTQ files. Would it be easier to demultiplex these vs. going through each bcl? If so, what would you suggest I do from there? Thank you again, I appreciate your help.

ADD REPLY • link 3.9 years ago by sr41489 • 0

1

Entering edit mode

I found that I have "undetermined" FASTQ files. Would it be easier to demultiplex these vs. going through each bcl?

Yes. Not easier compared to getting your provider do the multiplexing but it is possible. Do you know the index combinations you expect? Index1 is tricky since you will only know part of the index since other part would be UMI and hence variable. You could use demuxbyname.sh from BBMap suite or use deML mentioned in answer here: A: Demultiplexing Illumina data

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Unfortunately, getting this information from our usual pipeline will take a few weeks as they prioritize clinical samples (rightfully so) and I'm on the R&D side of things. With that, my boss wanted me to learn how to demultiplex and process the raw data so we can make optimizations to our workflow quicker.

Anyway, I do have the index combos I'd expect from this run, but let me show you an example of the headers from R1 & R2:

$ zcat Undetermined_S0_L001_R1_001.fastq.gz | head -n2
1:N:0:ACACGGTTTCAGAAAGT+NGGTTCGA
GGTGCNATCTTCACCAAAGCCTATCAACNAATGGTGCTAGATGCAGTGANATTAANNNACTTGNNGATTTTNNNGAATGGAACAAATGGTTNNNCTNNNNNNANNNNNNNNNANAGGGTTGNTACTTGCCATNCTCCTTNTGGTAA

$ zcat Undetermined_S0_L001_R2_001.fastq.gz | head -n2
2:N:0:ACACGGTTTCAGAAAGT+NGGTTCGA
AGCCCTGCTGTCTGGGTGGTTCTGACTCTTCAGGGGAGACCCAACATTATGAATTTTACTGAGTAGCCTCTCAAGATCTGGAAGCTTCTNTNGAAGCTNTNNNAATTNNNAGANTNNNTCAGNNNCAANNNNGAGNNNNTCTANNN

Now, I could tell that the header line here 1:N:0:ACACGGTTTCAGAAAGT+NGGTTCGA has index1 in bold and index2 in italics. The UMI is the normal text in between (9 nt). The same pattern is evident in R2's header.

Does this information make things easier for demultiplexing? Again, my apologies for my lack of knowledge on this, I really appreciate your help though.

ADD REPLY • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by sr41489 • 0

1

Entering edit mode

What makes this tricky is the presence of UMI.

Did you take a look at the deMLsoftware I mentioned above?

Also see my answer in Demultiplex Illumina run using custom index configuration for inspiration. It may be possible to demux using substring=t option using that command line. If you can post a longer list (4 sequences each) of example reads then I can take a look at what may work. Please post full fastq headers (you seem to have redacted parts). They are fine to post since they are anonymous.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Ah, my apologies, I have been doing both wet lab work and this, so I haven't had a deep read into this post but I will in just a moment. I've copied over the first 4 sequences from that command I posted above:

$ zless Undetermined_S0_L001_R1_001.fastq.gz
@NB500919:17:HMG2CBGXF:1:11101:12616:1037 1:N:0:ACACGGTTTCAGAAAGT+NGGTTCGA
GGTGCNATCTTCACCAAAGCCTATCAACNAATGGTGCTAGATGCAGTGANATTAANNNACTTGNNGATTTTNNNGAATGGAACAAATGGTTNNNCTNNNNNNANNNNNNNNNANAGGGTTGNTACTTGCCATNCTCCTTNTGGTAA
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEE#EEEAE###EEEEE##EEEEEE###AEEEEEEEEEEEEEEEE###AE######A#########E#EEEEEEE#EEEEEEEAEA#EEEAEE#<AEEEA
@NB500919:17:HMG2CBGXF:1:11101:19019:1037 1:N:0:AGTCTCACTCGAATTTT+NGGGGGGG
CTCATNGCTGTCCTTCAGGGTCTTCCTGNAATGCAGTGGTGCTTACGCTNCACCANNNAAGCANNAAACCTGNNGTATGAAGCCAGACCTCNNNGGNNNNNNTNNNNNNANNGNATGATCANACCTTTGAATNATTCTANTTTTTA
+
AAAAA#EEE6EEEA<EAEEEEEEEAEAA#EEEEE/EAAEE/EEE/EEEA#6EE/E###E/EEA##///6/EE##AEEAAE/A/<EE/EE/A###/A######E######E##A#EEEE</6#EE/A/<//E/#E/EEA<#AE<EE/
@NB500919:17:HMG2CBGXF:1:11101:2152:1037 1:N:0:AACGTGGAAATGCATCG+NCGATGTA
GCGCCNTTCTCCGCGTCGGGGCGGCCCGNAGCGCGGTGGCGCGGCGCGGNAGGGGNNNTCTGGNNCGTCCTNNNCCACCATGGCCAAACCANNNAGNNNNNNTNNNNNNNNNANGGAGAAGNTTAAGATTCTNTTGGGANTGGGAA
+
AAAAA#AEEEEEEE/AEEEEEEEEEEEE#EAEEEEEAAEEEEEAEEEEE#AEEEE###EEEEE##AEEEEE###EA/EE/EEAE<E//EEE###EA######<#########E#<E<EEEE#E/EEEAA</E#EEEEEE#EEEEEA
@NB500919:17:HMG2CBGXF:1:11101:13218:1037 1:N:0:CACGTTGTCAAACTTAT+NCACCTCA
GTGTCNAGACAAGAGGTCAAACCTAGAANGCTCACAAAAGTCAGATTTANTAAAANNNTGGAANNAAATTTGNNGTATTACCTTTCGTGGTNNNAGNNNNNNGNNNNNNCNNGNTGAAAACNTTGGCTTACTNGGAGCCNTAATTC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEE#EEEEE###EEEE/##EEEEEEE##EEEEEEEEEEEEEEEEE###AE######<######/##/#EEEEEEE#EEEEEEEEAE#/EEEEA#EEEEE<

Thank you again for your help. I'm going to start reading through those posts now.

ADD REPLY • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by sr41489 • 0

1

Entering edit mode

I did a bit of testing with BBMap. Unfortunately demuxbyname.sh (and substring mode) won't work in this case because of the UMI in Index 1. If deML does not work then you may need to write some custom code to do this demultiplexing. Were you following an established protocol or did you come up with this yourself?

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

As far as the library prep methods, these are what I've been optimizing, so I'm incorporating UMIs to remove PCR duplicates as my off-target rate has been very high after doing target enrichment. As far as the bioinformatic methods go, I've been finding various resources on demultiplexing with UMIs present. I've got 2 types of potential inputs: raw bcl files and these "undetermined" fastq files. I tried bcl2fastq last night and that failed unfortunately (I made sure the sample sheet had the appropriate setting info for UMI detection):

(base) -bash-4.2$ nohup /usr/local/bin/bcl2fastq --runfolder-dir /home/sraj/runfolder --output-dir /home/sraj/runfolder/Data/Intensities/BaseCalls nohup: ignoring input and appending output to ânohup.outâ Killed

Anyway, I'll try the deML workflow and continue researching my options if this doesn't work. Thank you so much for your help, I really appreciate it!

ADD REPLY • link 3.9 years ago by sr41489 • 0

0

Entering edit mode

AFAIK bcl2fastq can't deal with UMI's in Index sequences. UMI's need to be in read 1 or 2, so you can't use bcl2fastq.

ADD REPLY • link 3.9 years ago by GenoMax 141k

score 0 · Answer 1 · 2020-06-12

I recently came across a recommended method to do the demultiplexing of samples made with this kit from IDT that provides these adapters. This method does require that you have a reference genome for the organism. It also creates BAM files where special ZA tags are used to capture UMI's after the samples are demultiplexed. One needs access to the full raw data folder for the Illumina flowcell.