Hello,

I am working with R and Bioconductor packages. I have set big sets of sequences such as:

```
A=c("TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"....) # length ~ 500
B=c("CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT") # length ~10 000 000
```

What I would like to know ultimately is which are the sequences from B that match each sequence of A with a most 5 mismatches. e.g.: something like

```
res$A1
5, 5000,8 000 000...
res$A2
3005, 7560,5 003 542...
```

I could do loops or some "apply"... but it is taking ages...

I looked on the PDict, matchPDict, vwhichPDict side as well. It is much more efficient. But My sequences are too short: PDict would not let me set the max.mismatch parameter to 5.

As the sequences from A and B are exactly of the same length, I do not need searches or alignments. I probably just need to calculate the number of mismatches directly. But I cannot find a way of doing it really efficiently and quickly.

Any ideas please?

Many thanks

Are you allowing insertions/deletions or only substitutions?

Hello! Only substitutions.