I'm a little confused as to your read structure, but what I think your saying is:
- The read starts with a variable sequence that represents the miRNA sequence.
- This is followed by an 3' adaptor of sequence
AACTGTAGGCACCATCAAT
- There is then a 12 base UMI
- Then an RT primer, of sequence
AGATCGGAAGAGCACACGTCT
- Finally the universal adaptor of unknown length.
In the below I'm assuming that you wish to keep 1; discard 2, 4, 5 and extract 3 as the UMI. Let me know if thats not the case.
Its definitely possible with umi-tools. The UMI-tools regex read specification make heavy use of "named capture groups". In a regular expression, a "capture group" is a part of the pattern you want to capture and use again later. To define a part of a patter as a group, you enclose these parts of the pattern parentheses, e.g. ATA(.+)GCG caputres any sequence between ATA and GCG. Capture groups are usually reffered to by number, but it is possible to give then names. Thus ATA(?P<captured_group>.+)GCG will capture the same group as before, but assign it the name "captured_group".
In UMI-tools, you basically build a regex with named capture groups. Groups named discard_n are discarded and groups named cell_n or umi_n are added to the read name. Sequence not captured by a group is left on the read.
- Because your sequence of interset is on the 5' end of the read, your regex wants to start with any number of bases that want to be retained on the read
.+
- Next you want to find and discard the 3' adaptor:
(?P<discard_1>AACTGTAGGCACCATCAAT)
- You then want to capture and keep the next 12 characters as the UMI
(?P<umi_1>.{12})
- Finally you want to match as discard the RT primer, and anythin that comes after it
(?P<discard_2>AGATCGGAAGAGCACACGTCT.+) (the .+ basically means any number of any characters)
As the start of the read is not matched by the regex, it is left alone.
Thus, your regex will want to look something like:
.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)
This will only keep reads that have perfect matches to the RT primer and adaptor sequence. If you wish to allow some mismatches, you can do that by using the {s<=n} syntax. For example to allow one mismatch:
.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=1}(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+){s<=1}
Assuming that your data is single ended, the final umi_tools command without allowing mismatches might look like:
umi_tools extract --stdin=unprocessed_reads.fa.gz --stdout=processed_reads.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)'
What does a selection of your reads look like? Can you post 3 or 4?
Sure:
Tagging: i.sudbery
Thank you @genomax and @i.sudbery for your inputs.