I am working with R and Bioconductor packages. I have set big sets of sequences such as:
A=c("TATTCCTAGGTTCGCCT", "TTCCCGTTGCCCAGTGA"....) # length ~ 500 B=c("CTCCACACCAAAGCATC", "AACTGTGAGATTAATCT") # length ~10 000 000
What I would like to know ultimately is which are the sequences from B that match each sequence of A with a most 5 mismatches. e.g.: something like
res$A1 5, 5000,8 000 000... res$A2 3005, 7560,5 003 542...
I could do loops or some "apply"... but it is taking ages...
I looked on the PDict, matchPDict, vwhichPDict side as well. It is much more efficient. But My sequences are too short: PDict would not let me set the max.mismatch parameter to 5.
As the sequences from A and B are exactly of the same length, I do not need searches or alignments. I probably just need to calculate the number of mismatches directly. But I cannot find a way of doing it really efficiently and quickly.
Any ideas please?