find similar sequences between different sources
3
0
Entering edit mode
9.7 years ago
sam ▴ 130

Hello,

I have different sets of sequences from different sources (e.g. I have around 20 fasta files (each fasta file correspond to one source) where each fasta file contains around 1000 sequences).

I'm interested in identifying sequences that are similar and appear in more than one fasta file. In other words, I might find that sequence A happens to appear in all 20 fasta files, sequence B happens to appear in only 10 fasta files, sequence C happens to appear in a 2 fasta files.

Are there any tools that could do this? If not, any ideas how to tackle this problem in an efficient way?

thanks,

sequence sequencing RNA-Seq • 2.3k views
ADD COMMENT
4
Entering edit mode
9.7 years ago

blast each fasta file vs another and compile the results ?

ADD COMMENT
1
Entering edit mode
9.7 years ago

If the sequences are identical, you could use sequences as a hash table's keys and sequence-per-file counts as the hash table's values, incrementing the key's value only once per file.

ADD COMMENT
0
Entering edit mode

The sequences are not identical. Do you recommend setting a similarity threshold for 2 sequences to be identical?

ADD REPLY
0
Entering edit mode

Perhaps calculate Levenshtein distance between pairs of sequences to build a distance matrix (or apply another distance metric). You might apply a threshold with stringency based on the variety of distances in your matrix. If your population of strings are similar, then the pool of distances will have low values and you'd perhaps want a stringent threshold. If strings are disparate, the pool of distances will have larger values, and a relaxed threshold could be applied to decide similarity. (BLAST will probably be the most efficient approach, not least because there are so many BLAST services out there to do it quickly.)

ADD REPLY
0
Entering edit mode
9.7 years ago
Prakki Rama ★ 2.7k
Check if this post would suit your need. Create list of sequences present in multiple FASTA files
ADD COMMENT

Login before adding your answer.

Traffic: 1560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6