Question

find similar sequences between different sources

0

Entering edit mode

9.7 years ago

sam ▴ 130

Hello,

I have different sets of sequences from different sources (e.g. I have around 20 fasta files (each fasta file correspond to one source) where each fasta file contains around 1000 sequences).

I'm interested in identifying sequences that are similar and appear in more than one fasta file. In other words, I might find that sequence A happens to appear in all 20 fasta files, sequence B happens to appear in only 10 fasta files, sequence C happens to appear in a 2 fasta files.

Are there any tools that could do this? If not, any ideas how to tackle this problem in an efficient way?

thanks,

sequence sequencing RNA-Seq • 2.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by sam ▴ 130

1

Entering edit mode

9.7 years ago

Alex Reynolds 35k

If the sequences are identical, you could use sequences as a hash table's keys and sequence-per-file counts as the hash table's values, incrementing the key's value only once per file.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Alex Reynolds 35k

0

Entering edit mode

The sequences are not identical. Do you recommend setting a similarity threshold for 2 sequences to be identical?

ADD REPLY • link 9.7 years ago by sam ▴ 130

0

Entering edit mode

Perhaps calculate Levenshtein distance between pairs of sequences to build a distance matrix (or apply another distance metric). You might apply a threshold with stringency based on the variety of distances in your matrix. If your population of strings are similar, then the pool of distances will have low values and you'd perhaps want a stringent threshold. If strings are disparate, the pool of distances will have larger values, and a relaxed threshold could be applied to decide similarity. (BLAST will probably be the most efficient approach, not least because there are so many BLAST services out there to do it quickly.)

ADD REPLY • link 9.7 years ago by Alex Reynolds 35k

0

Entering edit mode

9.7 years ago

Prakki Rama ★ 2.7k

Check if this post would suit your need. Create list of sequences present in multiple FASTA files

ADD COMMENT • link 9.7 years ago by Prakki Rama ★ 2.7k

score 4 · Accepted Answer · 2014-07-23

4

Entering edit mode

9.7 years ago

Pierre Lindenbaum 161k

blast each fasta file vs another and compile the results ?

ADD COMMENT • link 9.7 years ago by Pierre Lindenbaum 161k