I'm a little confused as to your read structure, but what I think your saying is:
- The read starts with a variable sequence that represents the miRNA sequence.
- This is followed by an 3' adaptor of sequence
AACTGTAGGCACCATCAAT
- There is then a 12 base UMI
- Then an RT primer, of sequence
AGATCGGAAGAGCACACGTCT
- Finally the universal adaptor of unknown length.
In the below I'm assuming that you wish to keep 1; discard 2, 4, 5 and extract 3 as the UMI. Let me know if thats not the case.
Its definitely possible with umi-tools
. The UMI-tools regex read specification make heavy use of "named capture groups". In a regular expression, a "capture group" is a part of the pattern you want to capture and use again later. To define a part of a patter as a group, you enclose these parts of the pattern parentheses, e.g. ATA(.+)GCG
caputres any sequence between ATA
and GCG
. Capture groups are usually reffered to by number, but it is possible to give then names. Thus ATA(?P<captured_group>.+)GCG
will capture the same group as before, but assign it the name "captured_group".
In UMI-tools
, you basically build a regex with named capture groups. Groups named discard_n
are discarded and groups named cell_n
or umi_n
are added to the read name. Sequence not captured by a group is left on the read.
- Because your sequence of interset is on the 5' end of the read, your regex wants to start with any number of bases that want to be retained on the read
.+
- Next you want to find and discard the 3' adaptor:
(?P<discard_1>AACTGTAGGCACCATCAAT)
- You then want to capture and keep the next 12 characters as the UMI
(?P<umi_1>.{12})
- Finally you want to match as discard the RT primer, and anythin that comes after it
(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)
(the .+
basically means any number of any characters)
As the start of the read is not matched by the regex, it is left alone.
Thus, your regex will want to look something like:
.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)
This will only keep reads that have perfect matches to the RT primer and adaptor sequence. If you wish to allow some mismatches, you can do that by using the {s<=n}
syntax. For example to allow one mismatch:
.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=1}(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+){s<=1}
Assuming that your data is single ended, the final umi_tools
command without allowing mismatches might look like:
umi_tools extract --stdin=unprocessed_reads.fa.gz --stdout=processed_reads.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)'
What does a selection of your reads look like? Can you post 3 or 4?
Sure:
Tagging: i.sudbery
Thank you @genomax and @i.sudbery for your inputs.