Question: How to find sequence patterns in genome?
0
gravatar for Parham
3.8 years ago by
Parham1.4k
Sweden
Parham1.4k wrote:

Hi,

I want to find a pattern of sequence in a genome. Let's say to find following pattern (G4N(1-10))5 that translates to 4 Guanines followed by 1 to 10 bases of either A or T or G or C and then this pattern repeated for 5 times. 

I have FASTA file of the organism that I work with and I have basic knowledge of Python and regex. Is there a package or library that does the task or should I write whole code for myself. Initially I only want to know how many of the pattern exist in the reference sequence, but later it will be beneficial to know the start and stop positions as well. 

Thanks for help in advance!

pattern genome • 2.1k views
ADD COMMENTlink modified 3.8 years ago by Michael Dondrup46k • written 3.8 years ago by Parham1.4k
2

Here is a simple template script that prints the coordinates of the matching pattern. This finds the pattern only ones, in a bed format. You could explore it more. 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by geek_y10.0k

Thanks for sharing it. 

ADD REPLYlink written 3.8 years ago by Parham1.4k
1
gravatar for Michael Dondrup
3.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

FIMO seems to fit and can use regex and WM as input.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Michael Dondrup46k

I cannot figure out how it uses regex. Every regex function that I use the motif becomes red which means its not acceptable. I am trying to Make motif of three Gs and then up to 20 nucleotids any thing (ACGT) and then 4 Gs again. 
But it seems I cannot write something like G{3}N{1,20}G{3} . Do you know what I am missing? 

ADD REPLYlink written 3.8 years ago by Parham1.4k
1

Maybe nothing, I think fimo doesn't support extended posix expressions.

Dreg from EMBOSS supports PCRE expressions as a command line program, otherwise perl, python, php do all provide extended regular expressions. Here is a web server.

Note, dreg and standard pcre do not know about ambiguity codes, so you have to say [ACGT] if you want to match all nucleotides or [ACGTNYRW ...]  or simply . if your sequence contains ambiguity codes itself.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Michael Dondrup46k

Dreg is great. I used the command line version and it does what I need. Thanks for recommending it. 

ADD REPLYlink written 3.8 years ago by Parham1.4k

Regexes are explicit: 'N' doesn't mean any nucleotide, it means the character N. Try '[ACGT]' instead.

ADD REPLYlink written 3.8 years ago by harold.smith.tarheel4.5k

Indeed fimo can interpret IUPAC ambiguity codes correctly, while PCRE based programs do not. Fimo doesn't support the {1,20}  occurrence range options of PCREs though.

ADD REPLYlink written 3.8 years ago by Michael Dondrup46k

No I didn't use N for regex. N is used instead of any base according to FIMO manual. Cheers! 

ADD REPLYlink written 3.8 years ago by Parham1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 709 users visited in the last hour