How to find sequence patterns in genome?
3
0
Entering edit mode
6.0 years ago
Parham ★ 1.6k

Hi,

I want to find a pattern of sequence in a genome. Let's say to find following pattern (G4N(1-10))5 that translates to 4 Guanines followed by 1 to 10 bases of either A or T or G or C and then this pattern repeated for 5 times.

I have FASTA file of the organism that I work with and I have basic knowledge of Python and regex. Is there a package or library that does the task or should I write whole code for myself. Initially I only want to know how many of the pattern exist in the reference sequence, but later it will be beneficial to know the start and stop positions as well.

pattern genome • 3.3k views
2
Entering edit mode
6.0 years ago

Here is a simple template script that prints the coordinates of the matching pattern. This finds the pattern only ones, in a bed format. You could explore it more.

https://gist.github.com/gouthamatla/066f3607b5f96012b4dc

0
Entering edit mode

Thanks for sharing it.

1
Entering edit mode
6.0 years ago

FIMO seems to fit and can use regex and WM as input.

0
Entering edit mode

I cannot figure out how it uses regex. Every regex function that I use the motif becomes red which means its not acceptable. I am trying to Make motif of three Gs and then up to 20 nucleotids any thing (ACGT) and then 4 Gs again.

But it seems I cannot write something like G{3}N{1,20}G{3}. Do you know what I am missing?

1
Entering edit mode

Maybe nothing, I think fimo doesn't support extended posix expressions.

Dreg from EMBOSS supports PCRE expressions as a command line program, otherwise perl, python, php do all provide extended regular expressions. Here is a web server.

Note, dreg and standard pcre do not know about ambiguity codes, so you have to say [ACGT] if you want to match all nucleotides or [ACGTNYRW ...] or simply . if your sequence contains ambiguity codes itself.

0
Entering edit mode

Dreg is great. I used the command line version and it does what I need. Thanks for recommending it.

0
Entering edit mode

Regexes are explicit: 'N' doesn't mean any nucleotide, it means the character N. Try '[ACGT]' instead.

0
Entering edit mode

Indeed fimo can interpret IUPAC ambiguity codes correctly, while PCRE based programs do not. Fimo doesn't support the {1,20} occurrence range options of PCREs though.

0
Entering edit mode

No I didn't use N for regex. N is used instead of any base according to FIMO manual. Cheers!

0
Entering edit mode
7 weeks ago
Wayne ▴ 680

The pattern matching tool offered by the Saccharomyces Genome Database (SGD) and other genome sites has PatMatch as the basis.

The Saccharomyces Genome Database (SGD) has a nice, concise guide to the syntax for PatMatch patterns . PatMatch patterns allow use of N or X or . as any residue or base, and thus are more familiar to biologists than regular expressions. PatMatch allows use of IUPAC ambiguity codes.

You can run the PatMatch software yourself and I have a Github repository where you can easily launch environments served via the MyBinder.org service with PatMatch already installed . The launched sessions include several notebooks demonstrating how to use it with any genome sequence you can provide, as well as how to combine PatMatch results with Python for downstream analysis. Go to my patmatch-binder repo, click on the launch binder badge, and work through the Jupyter notebooks once the session launches.