Question: Removing UMI with UMI tools?
1
gravatar for simplyphage
7 months ago by
simplyphage10
simplyphage10 wrote:

I have miRNA sequences where the 3' adapter is flanked with 12N(random) UMI sequence+RT primer+Universal adapter.

I have tried using regex function in UMI tools but I am not sure if it will work because of the sequence pattern I have.

Please let me know if I have any chance of processing it with UMI tools.

Bold - 3' Adapter  ||   Italics - RT primer   ||  In between both of them -UMI Sequence ||  After RT primer UNiversal adapter.
NGGGGTGTGACAATGGTGTTT**AACTGTAGGCACCATCAAT**ATACCCTATGCT*AGATCGGAAGAGCACACGTCT*GAA
NTCTGACATGTGTGCG**AACTGTAGGCACCATCAAT**ACGACGACCCTA*AGATCGGAAGAGCACACGTCT*GAACTCCA
ADD COMMENTlink modified 7 months ago • written 7 months ago by simplyphage10

What does a selection of your reads look like? Can you post 3 or 4?

ADD REPLYlink written 7 months ago by genomax70k

Sure:

@SN526:340:HTLWLBCXX:1:1103:1846:1930 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTCTTAGATCGGAAG
+
#<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIG
@SN526:340:HTLWLBCXX:1:1103:2401:1817 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAATTAAGCCGTCGTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
+
#<DDDIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGF
@SN526:340:HTLWLBCXX:1:1103:2267:1834 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAATAACCCGCTTACCAGATCGGAAGAGCACACGTCTGAACTCCAG
+
#<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIHIIIHIIIGIHIIIIIIIHHHHDHEHECGHIIGHFIHHIIHHG
@SN526:340:HTLWLBCXX:1:1103:2513:1946 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAATATACTGACTACTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATC
+
#<<<@CHC?CEH<<<<@FHFCGEGHHH?<FHFEHIIIIFHHIHIE=CGCEHHHHHHDFEHCHHCCHHIEHIHHHE@
@SN526:340:HTLWLBCXX:1:1103:2705:1971 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATAGTGATGAACAGAGATCGGAAGAGCACACGTCTGAA
+
#<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHHIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2879:1843 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAATCATGCCGTGTTAAGATCGGAAGAGCACACGTCTGAACTCCAG
+
#<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIHHIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATTAGTTACAAATGAGATCGGAAGAGCACACGTCTGAA
+
#<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ADD REPLYlink modified 7 months ago by genomax70k • written 7 months ago by simplyphage10

Tagging: i.sudbery

ADD REPLYlink written 7 months ago by genomax70k

Thank you @genomax and @i.sudbery for your inputs.

ADD REPLYlink written 7 months ago by simplyphage10
3
gravatar for genomax
7 months ago by
genomax70k
United States
genomax70k wrote:

You could trim the RT-primer and all sequence to the end of the read, using bbduk.sh literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 option. That would ensure that your UMI's will now be at the 3-end most end. Then you should be able to use the --3prime option from umi_tools.

Final solution based on thread below:

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime | bbduk.sh in=stdin.fq out=final.fq literal=AACTGTAGGCACCATCAAT ktrim=r k=15 minlen=0
ADD COMMENTlink modified 7 months ago • written 7 months ago by genomax70k

Yeah, that seems rational. I am relatively new to the sequencing world. So, Thank you so much. I will give it a try. Just a quick question. Can i also do the same thing with trim galore as well?

ADD REPLYlink modified 7 months ago • written 7 months ago by simplyphage10

See if this does what you need. I can move to an answer once you test:

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime

results in

##  INFO Starting barcode extraction
@SN526:340:HTLWLBCXX:1:1103:1846:1930_TTAGATCGGAAG 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTC
+
!<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIH
@SN526:340:HTLWLBCXX:1:1103:2401:1817_TAAGCCGTCGTG 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAAT
+
!<DDDIIIIIIIIHIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2267:1834_AACCCGCTTACC 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAAT
+
!<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIH
@SN526:340:HTLWLBCXX:1:1103:2513:1946_ATACTGACTACT 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAAT
+
!<<<@CHC?CEH<<<<@FHFCGEGHHH
@SN526:340:HTLWLBCXX:1:1103:2705:1971_AGTGATGAACAG 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAAT
+
!<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2879:1843_CATGCCGTGTTA 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAAT
+
!<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874_TAGTTACAAATG 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAAT
+
!<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

This does not remove the reads that don't have the right adapter (for example read 1 in your test data above). You may need to do two step bbduk.sh to remove those.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax70k

Sorry, i did try to run it on my laptop but it turned off..due to lack of memory or something i believe. I will get back to you once i get access to the cluster.

ADD REPLYlink written 7 months ago by simplyphage10

Coudn't get access to the cluster..but i tried the same thing with Cutadapt and Trim_trim galore.

But there's a problem now...I noticed that some of the sequences (Example:The second sequence) don't have the RT primer attached. How should i proceed with these type of sequences?

SN526:340:HTLWLBCXX:1:1103:1684:1854 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTGTAACTGTAGGCACCATCAATAACCTTCATTC
+
#<DDDIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:1846:1930 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTCTT
+
#<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHHI
@SN526:340:HTLWLBCXX:1:1103:2401:1817 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAATTAAGCCGTCGTG
+
#<DDDIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIIIIIIIH
@SN526:340:HTLWLBCXX:1:1103:2267:1834 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAATAACCCGCTTACC
+
#<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIHIIIHIIIGIHII
@SN526:340:HTLWLBCXX:1:1103:2513:1946 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAATATACTGACTACT
+
#<<<@CHC?CEH<<<<@FHFCGEGHHH?<FHFEHIIIIF
@SN526:340:HTLWLBCXX:1:1103:2705:1971 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATAGTGATGAACAG
+
#<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHHI
@SN526:340:HTLWLBCXX:1:1103:2879:1843 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAATCATGCCGTGTTA
+
#<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIIIIIIIIIIHIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATTAGTTACAAATG
+
#<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3147:1888 1:N:0:ATCACG
NAAGGCTGTGCTGACCATCGATAACTGTAGGCACCATCAATGCAACATCAACG
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3141:1981 1:N:0:ATCACG
NTGAAGAAATAGAACTGTAGGCACCATCAATGACGACCGCCCT
+
#<<DDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3713:1818 1:N:0:ATCACG
NCCGGGGAAACTGTAGGCACCATCAATCATACTAAAACT
ADD REPLYlink modified 7 months ago by genomax70k • written 7 months ago by simplyphage10

I noticed that some of the sequences (Example:The second sequence) don't have the RT primer attached.

Thinking off the top of my head. You could set maxlength=N-1 in bbduk.sh options to weed out those reads that don't get trimmed at all (which presumably lack RT primer). Note: Where N is length of your sequencing.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax70k

Yeah, we will discard those. Going through the data 96% of the seq. contain RT primer, which is workable for us. Thanks again

ADD REPLYlink written 7 months ago by simplyphage10
1

Just to complete my solution you will need another bbduk.sh pass to remove the 3'-adapter.

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime | bbduk.sh in=stdin.fq out=final.fq literal=AACTGTAGGCACCATCAAT ktrim=r k=15 minlen=0
ADD REPLYlink modified 7 months ago • written 7 months ago by genomax70k
3
gravatar for i.sudbery
7 months ago by
i.sudbery5.2k
Sheffield, UK
i.sudbery5.2k wrote:

I'm a little confused as to your read structure, but what I think your saying is:

  1. The read starts with a variable sequence that represents the miRNA sequence.
  2. This is followed by an 3' adaptor of sequence AACTGTAGGCACCATCAAT
  3. There is then a 12 base UMI
  4. Then an RT primer, of sequence AGATCGGAAGAGCACACGTCT
  5. Finally the universal adaptor of unknown length.

In the below I'm assuming that you wish to keep 1; discard 2, 4, 5 and extract 3 as the UMI. Let me know if thats not the case.

Its definitely possible with umi-tools. The UMI-tools regex read specification make heavy use of "named capture groups". In a regular expression, a "capture group" is a part of the pattern you want to capture and use again later. To define a part of a patter as a group, you enclose these parts of the pattern parentheses, e.g. ATA(.+)GCG caputres any sequence between ATA and GCG. Capture groups are usually reffered to by number, but it is possible to give then names. Thus ATA(?P<captured_group>.+)GCG will capture the same group as before, but assign it the name "captured_group".

In UMI-tools, you basically build a regex with named capture groups. Groups named discard_n are discarded and groups named cell_n or umi_n are added to the read name. Sequence not captured by a group is left on the read.

  • Because your sequence of interset is on the 5' end of the read, your regex wants to start with any number of bases that want to be retained on the read .+
  • Next you want to find and discard the 3' adaptor: (?P<discard_1>AACTGTAGGCACCATCAAT)
  • You then want to capture and keep the next 12 characters as the UMI (?P<umi_1>.{12})
  • Finally you want to match as discard the RT primer, and anythin that comes after it (?P<discard_2>AGATCGGAAGAGCACACGTCT.+) (the .+ basically means any number of any characters)

As the start of the read is not matched by the regex, it is left alone.

Thus, your regex will want to look something like:

.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)

This will only keep reads that have perfect matches to the RT primer and adaptor sequence. If you wish to allow some mismatches, you can do that by using the {s<=n} syntax. For example to allow one mismatch:

.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=1}(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+){s<=1}

Assuming that your data is single ended, the final umi_tools command without allowing mismatches might look like:

umi_tools extract --stdin=unprocessed_reads.fa.gz --stdout=processed_reads.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)'
ADD COMMENTlink modified 7 months ago • written 7 months ago by i.sudbery5.2k

No, No. 2 is supposed to be the 3 primer Adapter. and No. 4 is supposed to be the RT primer RT Primer is followed by the Universal adapter, which varies in length.

If I remove 2 and 4 and extract UMI, I am still left with the Universal adapter, which varies in length.

ADD REPLYlink written 7 months ago by simplyphage10

You need to remove universal adapter using conventional scanning/trimming. Best option for you is to look for reads that contain RT primer and only process those further.

ADD REPLYlink written 7 months ago by genomax70k

Yeah, that seems to be the only viable option.

ADD REPLYlink written 7 months ago by simplyphage10

Answer updated to reflect this.

ADD REPLYlink written 7 months ago by i.sudbery5.2k

I gave it a try, but the no. of regex matches i got is incredibly low !

2019-01-15 18:59:01,497 INFO Input Reads: 5442010 2019-01-15 18:59:01,497 INFO regex does not match read1: 5438362 2019-01-15 18:59:01,497 INFO regex matches read1: 3648 2019-01-15 18:59:01,497 INFO Reads output: 364

ADD REPLYlink written 7 months ago by simplyphage10
1

My bad, the search for the barcode anchors at the 5' end of the read, so you need to specify that you want to accept any number of any base at the start of the read before the 3' adaptor.

ADD REPLYlink modified 7 months ago • written 7 months ago by i.sudbery5.2k

Final Command that worked for us:

umi_tools extract --stdin=BU243_S1_L001_R1_001.fastq.gz --log=withonesubs2.log --stdout=processed_reads_3.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.+)'
ADD REPLYlink modified 7 months ago by genomax70k • written 7 months ago by simplyphage10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour