Question: 7N Motif Search Over The Genome
4
gravatar for PoGibas
7.3 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

I do have really short word size (microRNA target sequence).

Want to search enrichment of those motifs in my DNA seq & lots of randomly simulated same length genome sequences. (I am going to RNA->DNA before the search).

What way, what tool should I use for such short motif search?

I have heard about Vmatch, but maybe there is a free software?

Really looking forward to your answers and suggestions...

PS.: Or any simple pl script (within y.fa search x motif) would work fine.

motif • 2.8k views
ADD COMMENTlink modified 6.3 years ago • written 7.3 years ago by PoGibas4.8k
6
gravatar for Farhat
7.3 years ago by
Farhat2.9k
Pune, India
Farhat2.9k wrote:

You can use the following script for that. The usage is perl patt_search.pl fasta_file.fa AATTATA TATA ... if you save the script as patt_search.pl. You can give any number of motif sequences. It will recognize IUPAC DNA ambiguity codes. The output is a bit weird because I used it as a feed into another program but it looks like this.

{"chrX:6362554-6365728",{{"TAATTA"}, {260, 2466, 2875}}, {{"CCCCCCCC"}, {1412}}},
{"chrX:6379561-6405165",{{"TAATTA"}, {275, 776, 1048, 1226, 1722, 2753, 3585, 3644, 4951, 5084, 11164, 12712, 16259, 17695, 18211, 18574, 18745, 19204, 19838, 19859, 21405, 23529, 23740, 24372}}, {{"CCCCCCCC"}, {4536, 5673, 9148, 12449, 14132, 16375, 20132, 20140, 21463, 21471, 21975}}},

It contains the fasta header followed by the motif searched for followed by all the locations that it was found on within that sequence. The program can be downloaded from https://github.com/Farhat/patt_search

ETA: Now it can handle more complicated DNA character strings like TTA{3,7}T and their corresponding reverse complements.

ADD COMMENTlink modified 6.7 years ago • written 7.3 years ago by Farhat2.9k

Thanks! You saved me two days at least! :) Does it do rev/comp too?

ADD REPLYlink written 7.3 years ago by PoGibas4.8k

Yes, it will search for reverse complements too. You can also use IUPAC ambiguity codes and N to match any base.

ADD REPLYlink modified 6.7 years ago • written 7.3 years ago by Farhat2.9k

Does this code also find patterns like ACA{0,7}TG and detect patterns as follows in input stream ACAAAAAAATG, ACAATG, ACAAAATG be detected? and Does N for {A or T or G or C} also work?

As an extension I would like to ask if it is possible to read muliFasta file with the given header? It will be of great help, I can get that done!!

PS: I am not a perl person yet ;) would love to use the code just as it is and format the output to my need (basically a bed file), if it works!!

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by k.nirmalraman1000
1

No, it will not work for general regular expressions. The expansion for N isn't supported but it is a minor change. I'll edit the program to include that.

ADD REPLYlink written 6.7 years ago by Farhat2.9k

Thank you very much!!

Just for the record, dna pattern match with some advanced option is available here as part of RSAT tool. However, one cannot integrate this to a analysis pipeline. I would like that... :)

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by k.nirmalraman1000

I was actually hoping I can extend this script a bit, to find character repetitions like I mentioned above i.e., ACA\{0,7\}TG to find ACAAAAAAATG and ACAATG and so on....

I added $patt =~ s/\d+/$&/g; to replace_ambiguous subroutine before the return statement.

Changed a bit of reverse complement to $revcomp =~ tr/ACGTacgt[]{}N/TGCAtgca][}{./; to accomodate paranthesis { }.

What I end up searching in the FASTA file for reverse strand is a problem.

Eg., Input in argument : CR\{7,10\}N\{5,8\}ATGC

Generated Forward Strand Look Up: C[AG]{7,10}[ACGT]{5,8}ATGC

Generated Reverse strand: GCAT{8,5}[ACGT]{01,7}[CT]G

The reverse complement string is a problem.... I don't think there is a easy way to do it from my limited knowledge... May be can you help me to achieve this???

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by k.nirmalraman1000
1

This is indeed a bit more complicated but can be solved with regular expressions. You can download the modified program at https://github.com/Farhat/patt_search You will have to enclose your patterns in quotes when using it on the command line to prevent shell from parsing braces.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Farhat2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour