Hi folks,
I have reads with a known construct from which I want to extract some subsequence. As an example, this sequence represents what I'm looking at:
aaaaaaaaaaaaaaaabbbbbbbbbbbbbbccccccccccddddddddeeeeeeeeeeeeeeeeeeee
where the 'a' bases are nanopore adapter sequence 'b' bases are another adapter sequence 'c' and 'd' are the bases I want to extract. The 'c' sequences should come from a set of known sequences while the 'd' sequences are of known length, but unkown content. 'e' should be either a polyA or polyT sequnce of unknown length.
I have tried analyzing this by using porechop to trim the nanopore adapter 'a', skewer (with high error tolerance) to trim 'b', then searching for a polyA or polyT sequence to then find c and d and match d to the known sequences.
This whole thing seems very heuristic, and if my polyA track has a base error near the interface with the 'd' sequence is likely to by off by a base or two.
I feel as though my determination of the 'c' and 'd' sequence would be better if I had a way to look for a sequence using a regex to match the known sequence surrounding my sequence of interest, but also with error tolerance.
What sort of approach would you try?
I would recommend that you try
bbduk.sh
in filter mode with sequence you expect inliteral=
option. Here is a guide to get you started.