complex pattern search in whole genomes
0
0
Entering edit mode
4.8 years ago

Hi, I am looking to find all occurrences of a nucleotide pattern in a multifasta genome, with n number of mismatches in one part of the pattern, and m number of the mismatches in the rest.

For example:

AGCAGCATAGCAGCAAGCAGT[up to 4 mismatches]GCAGACGCA[UP TO 2 MISMATCHES]

Does anyone know how to search for this type of complex pattern. An existing tool, or perl/python script ? Ambiguous symbols such as N and R incorporation would be much needed as well.

thanks for posting your answers.

genome sequence alignment • 640 views
ADD COMMENT
0
Entering edit mode

Hi, I'm not aware of any tool which does this out of the box (I' would be happy to be corrected).

I have 2 suggestions:

1) implement fast scoring function and traverse the sequence

  • advantages: your own rules (regarding N, R and other stuff)
  • disadvantages: coding, slower (even with cPython)

2) use library with regex with implemented non-exact matching. eg. https://pypi.org/project/regex/

  • advantages: coding (super simple script), computation time (fast)
  • disadvantages: can't handle the (N, R....)
ADD REPLY

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6