Nanopore technology is great for generating full-length reads but the downside is a high error-rate. Yet, I would need to look for the presence of a specific sequence of 20nt (or a significant substring) at the beginning of my reads. By visual inspection seems to be present in my reads (based on TTGAG at the end of the sequence) but I can see lots of errors and the sequence is often truncated (~8-12nt recovered) :
native sequence : GGTTTAATTACCCAAGTTTGAG
read 1 CCGCAAGTTTTGAG
read 2 CCGCTGACTTGAG
read 3 ACACAAGTTTGAG
read 4 CACTGATTTGAG
Nanopore error rate should be around 10-20% but I think that the fact the sequence is usually at one end of the read and has lots of repetition (CCC and TTT) is also making things worse (just an intuition though !).
I know BBduk can be used to identify a sequence while allowing a certain tolerance but I'm looking for something more flexible, like a python library that I could use within my own code.
I know some scripts (like porechop) are used to identify and trim nanopore's adapters from the reads so I'm basically looking at something that would do the same, minus the trimming part. I looked into porechop code but it seems it uses C++ code and I'm too novice in coding to really understand how it's done:
Porechop uses SeqAn to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as Edlib.
Then I tried to look directly into SeqAn and after a while I found a python wrapper (seqanpy) but I didn't manage to install it despite following instructions.
I also tried to look into algorithms that allow for fuzzy String Matching (fuzzywuzzy) but I think it would not work for finding substrings. If I search my sequence against a 1kb read it will probably find ~0% similarity.
In the end, I got lost from going to one github repository to another so I thought it would be wise to come here and ask for guidance. As anyone ever tried to do something like that ? Or know of any tool that could do the trick ? At the moment I don't know where to really look for my answer but if someone can put me on the right path at least I won't be wasting my time like I've been doing recently.
Any thoughts/strategy/comments are greatly appreciated ! Thanks in advance.
Is the 20nt string either present or absent, or are you doing some barcode matching in which multiple strings are used?
Either present or absent, yes. It's a sequence that is often trans-spliced in 5' of the mRNAs.