Question

Seed Search Of Pattern Sequence In A Given Fasta File Database

0

Entering edit mode

11.5 years ago

Varun Gupta ★ 1.3k

Hello Everyone

I have a fasta file of seed sequences. It looks like this

>hsa-miR-140-5p_1_8
AACCACTG
>hsa-miR-140-5p_1_7
ACCACTG
>hsa-miR-140-5p_1_6
CCACTG
>hsa-miR-140-5p_2_8
AACCACT
>hsa-miR-140-5p_2_7
ACCACT

This fasta file consists of miRNA seed sequences and the maximum seq length is 8 and minimum is 6. lets name it as seed.fa

Also I have created a fasta file of miRNA sequences lets say miRNA.fa

I would like to search this miRNA.fa file for seed sequences in file seed.fa

I built the index of the miRNA.fa using bowtie-build and then used this command for alignment

/apps1/bowtie -n 0 -a --sam --threads 2  /BOWTIE/index -f seed.fa > seed_out.fa.sam

I purposely used n as 0 because i don't want any mismatches in the seed itself(as they are very small).

Is there any other way i can do the seed search for file seed.fa for my miRNA.fa sequences

Let me know

Thanks

Varun

search • 4.1k views

ADD COMMENT • link updated 11.5 years ago by Pavel Senin ★ 1.9k • written 11.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Why don;t you try SHRiMP2 aligner in miRNA alignment mode. They have also explained what seeds to use when aligning miRNAs. http://compbio.cs.toronto.edu/shrimp/README Ctrl+F "miRNA"

ADD REPLY • link 11.5 years ago by Ashutosh Pandey 12k

score 0 · Answer 1 · 2014-01-13

Since you don't want mismatches, it's quite easy to implement using regular expressions in any programming language, or even using a command line tool. It can even be done by using a substring identification function.

I'll do here a brief example in perl using index function (http://perldoc.perl.org/functions/) but it's quite straightforward in most programming languages. For instance:

if (index($mirnasequence, $seed) != -1) { print "$mirnasequence contains $seed sequence!\n"; ......do whatever necessary here.... }

And place this in a double loop (for all seeds, seach in all miRNAs)

If you don't want to code then the bowtie solution is quite easy and robust. Another option using minimal code is also using grep from command line and putting it in a loop to search seed matches in a miRNA text file.

Cheers,

IV

score 0 · Answer 2 · 2014-01-13

you can do this in R too, if you are after exact matches:

require(Biostrings)
require(reshape)
a = melt(readDNAStringSet("~/tmp/1.fa"))
b = melt(readDNAStringSet("~/tmp/2.fa"))
a$names=row.names(a)
b$names=row.names(b)
merge(a,b,by="value")

here is what data looks like:

> a
                      value              names
hsa-miR-140-5p_1_8 AACCACTG hsa-miR-140-5p_1_8
hsa-miR-140-5p_1_7  ACCACTG hsa-miR-140-5p_1_7
hsa-miR-140-5p_1_6   CCACTG hsa-miR-140-5p_1_6
hsa-miR-140-5p_2_8  AACCACT hsa-miR-140-5p_2_8
hsa-miR-140-5p_2_7   ACCACT hsa-miR-140-5p_2_7
> 
> b
                     value              names
hsa-miR-140-5p_1_7 ACCACTG hsa-miR-140-5p_1_7
hsa-miR-140-5p_1_6  CCACTG hsa-miR-140-5p_1_6

and the result:

> merge(a,b,by="value")
    value            names.x            names.y
1 ACCACTG hsa-miR-140-5p_1_7 hsa-miR-140-5p_1_7
2  CCACTG hsa-miR-140-5p_1_6 hsa-miR-140-5p_1_6

if the seed is say in the b data frame and you are looking for seeds hits in a dataset, it can be done like this:

res = apply(b, 1, function(x) {apply(a,1, function(z) grep(x[1],z[1]))})
unlist(res)

which yields

> unlist(res)
hsa-miR-140-5p_1_7.hsa-miR-140-5p_1_8 hsa-miR-140-5p_1_7.hsa-miR-140-5p_1_7 
                                    1                                     1 
hsa-miR-140-5p_1_6.hsa-miR-140-5p_1_8 hsa-miR-140-5p_1_6.hsa-miR-140-5p_1_7 
                                    1                                     1 
hsa-miR-140-5p_1_6.hsa-miR-140-5p_1_6 
                                    1