Question

Extract UMIs using UMI-tools

2

Entering edit mode

23 months ago

I.Kim ▴ 40

Hi guys,

For the first time, we used UMIs for the RNAseq. So I don't know exactly how to deal with that. I would really appreciate it if you could help me.

We used SMARTer Stranded Total RNA-Seq Kit v3 for the library preparation. And the manual show that

'The first 8 nt of the second sequencing read (Read 2) are UMIs (dark purple) followed by 3 nucleotides of  UMI-linker (shown as NNN) and 3 nucleotides derived from the Pico v3 SMART UMI Adapter (shown as XXX).'

enter image description here

And I would like to try as below (8nt UMI):

umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNNNNNN \ 
  --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \
  --read2-out=processed.2.fastq.gz

Is it correct? or should I use the linker and adapter? (14nt)

umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNNNNNNNNNNNN \ 
  --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \
  --read2-out=processed.2.fastq.gz

Please help me to figure out. Thanks a lot.

Kim

P.S. And our fastq file of second reads as below:

@NB551656:25:H3N2YBGXK:1:11101:17925:1178 2:N:0:4
GTCATGAACGAGTCAGGCCAAGGGCATCAATTGCCCGTCACCGGAAGGCGCATTCTACGTCTACCCGTCCTGCGCC
+
AAAAAEEE////A/EEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEAEAE
@NB551656:25:H3N2YBGXK:1:11101:6227:1179 2:N:0:4
GTGGGTTCCTTTGGTCTTGTTGCGTACCTGGAGAACGGAAGAGCGTCGTGTAGGGAAATAGTGTAAGTCCAAGTGT
+
AAAAAEEA/A///EE//EE/EE/AEEE<E//E/E/AE/EE//E6/EE/E/EE/E<E<E/E/E/EAEE</6AE///A
@NB551656:25:H3N2YBGXK:1:11101:14235:1180 2:N:0:4
ACTAAGCGGTGGGGTGATCGCCGAGAGCAAAGGTAAGGCTAAGAAAGGAAGACCAGGTTGGAGCCTTGAGAAAAAT

UMI UMI-tools • 5.1k views

ADD COMMENT • link updated 16 months ago by Matthias Zepper 4.5k • written 23 months ago by I.Kim ▴ 40

1

Entering edit mode

TakaraBio tells you what to do with these in their user manual (page 25):

trim 8 nt UMIs + 3 nt UMI linker + 3 nt from Pico v3 SMART UMI Adapter from Read2 prior to mapping.

They also appear to make some software available to do this here: https://www.takarabio.com/products/next-generation-sequencing/bioinformatics-tools/cogent-ngs-analysis-pipeline

ADD REPLY • link 23 months ago by GenoMax 141k

0

Entering edit mode

Thanks a lot for your reply. I want to use UMIs (8nt) but maybe trim linker (3nt) and adapter (3nt). cogent software is not best option for us, because we bought another software. Is there any option for the keep 8nt but trim 6nt in umi_tools?

ADD REPLY • link 23 months ago by I.Kim ▴ 40

score 1 · Answer 1 · 2022-05-04

Using your example you would need to only process read 2 since based on Takarabio directions that is the read that contains UMI.

$ more test.fq
@NB551656:25:H3N2YBGXK:1:11101:17925:1178 2:N:0:4
GTCATGAACGAGTCAGGCCAAGGGCATCAATTGCCCGTCACCGGAAGGCGCATTCTACGTCTACCCGTCCTGCGCC
+
AAAAAEEE////A/EEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEAEAE
@NB551656:25:H3N2YBGXK:1:11101:6227:1179 2:N:0:4
GTGGGTTCCTTTGGTCTTGTTGCGTACCTGGAGAACGGAAGAGCGTCGTGTAGGGAAATAGTGTAAGTCCAAGTGT
+
AAAAAEEA/A///EE//EE/EE/AEEE<E//E/E/AE/EE//E6/EE/E/EE/E<E<E/E/E/EAEE</6AE///A

$ umi_tools extract --stdin=test.fq --bc-pattern=NNNNNNNN --log=processed.log --stdout=processed.fastq

$ more processed.fastq 
@NB551656:25:H3N2YBGXK:1:11101:17925:1178_GTCATGAA 2:N:0:4
CGAGTCAGGCCAAGGGCATCAATTGCCCGTCACCGGAAGGCGCATTCTACGTCTACCCGTCCTGCGCC
+
////A/EEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEAEAE
@NB551656:25:H3N2YBGXK:1:11101:6227:1179_GTGGGTTC 2:N:0:4
CTTTGGTCTTGTTGCGTACCTGGAGAACGGAAGAGCGTCGTGTAGGGAAATAGTGTAAGTCCAAGTGT
+
/A///EE//EE/EE/AEEE<E//E/E/AE/EE//E6/EE/E/EE/E<E<E/E/E/EAEE</6AE///A

After this you can hard trim 6 bp from front of processed reads to remove the linker+UMI adapter using reformat.sh from BBMap suite.

$ reformat.sh -Xmx2g in=processed.fastq out=out.fq forcetrimleft=6

$ more out.fq

@NB551656:25:H3N2YBGXK:1:11101:17925:1178_GTCATGAA 2:N:0:4
AGGCCAAGGGCATCAATTGCCCGTCACCGGAAGGCGCATTCTACGTCTACCCGTCCTGCGCC
+
EEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEAEAE
@NB551656:25:H3N2YBGXK:1:11101:6227:1179_GTGGGTTC 2:N:0:4
TCTTGTTGCGTACCTGGAGAACGGAAGAGCGTCGTGTAGGGAAATAGTGTAAGTCCAAGTGT
+
E//EE/EE/AEEE<E//E/E/AE/EE//E6/EE/E/EE/E<E<E/E/E/EAEE</6AE///A

score 1 · Answer 2 · 2022-05-04

I am not familiar with that particular kit, but you are almost there. However, the UMI is in your second read, so using --bc-pattern would mean that the UMI is being looked for in the wrong read!

As the UMI-tools manual describes, you can use the letters N C and X to denote the composition of the barcode in string mode. Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. So using --extract-method=string and --bc-pattern2=NNNNNNNN should work.

Even more clever might be the --bc-pattern2=NNNNNNNNCCCCCC, which would treat the UMI-Linker and 3 base adapter sequence as cell barcode and move it to the header, too. Hence, it trims the linker away such that it doesn't interfere with the mapping later, since denoting the bases with an X would mean that they are reattached to the read and thus not removed. Treating the linker as cell barcode would also help to corroborate if you are correctly trimming the UMI, because that sequence should always be the same. So by visually inspecting a few read headers later, you will know if everything worked out as expected.

If first base of read2 is not always the first UMI base, which might happen if there is still a part of the adapter sequence in the read, use the Regex mode instead. In that case the 3'-sequence of the i7 adapter as well as the linker sequence can be used to locate your UMI.

I couldn't find the sequence of the linker (but you should be able to see that in your reads easily), but the i7 adapter should be GATCGGAAGAGCACACGTCTGAACTCCAGTCAC-[index i7]-ATCTCGTATGCCGTCTTCTGCTTG. Therefore, a possible Regex to locate your UMI could look like --bc-pattern2="(?P<discard_1>CGTCTTCTGCTTG){s<=1}(?P<umi_1>.{8})(?P<discard_2>REPLACE WITH SEQUENCE OF THE LINKER)".

You might need to adapt that depending on the amount of adapter still to be found. It might also be possible to only use the sequence of the linker as anchor and skip the discard_1 group entirely. Good luck with your project!

PS: I would appreciate if you could post the six bases here that constitute the NNNXXX part in the figure above as a future reference in case we start working with the same kit. Thanks!

score 1 · Answer 3 · 2022-11-28

1

Entering edit mode

16 months ago

Ömer An ▴ 260

You can also refer to Nextflow documentation here: https://nf-co.re/rnaseq/3.9/usage#unique-molecular-identifiers-umi

ADD COMMENT • link 16 months ago by Ömer An ▴ 260

0

Entering edit mode

In all fairness, I have to admit that this documentation didn't exist when this question was posted and also the pipeline didn't support --bc-pattern2 before version 3.9. I added it afterwards and used exactly this particular kit as example because of this question ;-)

ADD REPLY • link 16 months ago by Matthias Zepper 4.5k