How to find sequence patterns in genome?
3
0
Entering edit mode
6.0 years ago
Parham ★ 1.6k

Hi,

I want to find a pattern of sequence in a genome. Let's say to find following pattern (G4N(1-10))5 that translates to 4 Guanines followed by 1 to 10 bases of either A or T or G or C and then this pattern repeated for 5 times. 

I have FASTA file of the organism that I work with and I have basic knowledge of Python and regex. Is there a package or library that does the task or should I write whole code for myself. Initially I only want to know how many of the pattern exist in the reference sequence, but later it will be beneficial to know the start and stop positions as well. 

Thanks for help in advance!

pattern genome • 3.2k views
ADD COMMENT
2
Entering edit mode
6.0 years ago

Here is a simple template script that prints the coordinates of the matching pattern. This finds the pattern only ones, in a bed format. You could explore it more.

https://gist.github.com/gouthamatla/066f3607b5f96012b4dc

ADD COMMENT
0
Entering edit mode

Thanks for sharing it.

ADD REPLY
1
Entering edit mode
6.0 years ago

FIMO seems to fit and can use regex and WM as input.

ADD COMMENT
0
Entering edit mode

I cannot figure out how it uses regex. Every regex function that I use the motif becomes red which means its not acceptable. I am trying to Make motif of three Gs and then up to 20 nucleotids any thing (ACGT) and then 4 Gs again.

But it seems I cannot write something like G{3}N{1,20}G{3}. Do you know what I am missing?

ADD REPLY
1
Entering edit mode

Maybe nothing, I think fimo doesn't support extended posix expressions.

Dreg from EMBOSS supports PCRE expressions as a command line program, otherwise perl, python, php do all provide extended regular expressions. Here is a web server.

Note, dreg and standard pcre do not know about ambiguity codes, so you have to say [ACGT] if you want to match all nucleotides or [ACGTNYRW ...] or simply . if your sequence contains ambiguity codes itself.

ADD REPLY
0
Entering edit mode

Dreg is great. I used the command line version and it does what I need. Thanks for recommending it.

ADD REPLY
0
Entering edit mode

Regexes are explicit: 'N' doesn't mean any nucleotide, it means the character N. Try '[ACGT]' instead.

ADD REPLY
0
Entering edit mode

Indeed fimo can interpret IUPAC ambiguity codes correctly, while PCRE based programs do not. Fimo doesn't support the {1,20} occurrence range options of PCREs though.

ADD REPLY
0
Entering edit mode

No I didn't use N for regex. N is used instead of any base according to FIMO manual. Cheers!

ADD REPLY
0
Entering edit mode
6 weeks ago
Wayne ▴ 640

The pattern matching tool offered by the Saccharomyces Genome Database (SGD) and other genome sites has PatMatch as the basis.

The Saccharomyces Genome Database (SGD) has a nice, concise guide to the syntax for PatMatch patterns . PatMatch patterns allow use of N or X or . as any residue or base, and thus are more familiar to biologists than regular expressions. PatMatch allows use of IUPAC ambiguity codes.

You can run the PatMatch software yourself and I have a Github repository where you can easily launch environments served via the MyBinder.org service with PatMatch already installed . The launched sessions include several notebooks demonstrating how to use it with any genome sequence you can provide, as well as how to combine PatMatch results with Python for downstream analysis. Go to my patmatch-binder repo, click on the launch binder badge, and work through the Jupyter notebooks once the session launches.

ADD COMMENT

Login before adding your answer.

Traffic: 2249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6