Question: find similar sequences between different sources
0
gravatar for sam
4.8 years ago by
sam130
United States
sam130 wrote:

Hello,

I have different sets of sequences from different sources (e.g. I have around 20 fasta files (each fasta file correspond to one source) where each fasta file contains around 1000 sequences).

I'm interested in identifying sequences that are similar and appear in more than one fasta file. In other words, I might find that sequence A happens to appear in all 20 fasta files, sequence B happens to appear in only 10 fasta files, sequence C happens to appear in a 2 fasta files.

Are there any tools that could do this? If not, any ideas how to tackle this problem in an efficient way?

thanks,

sequencing rna-seq sequence • 1.4k views
ADD COMMENTlink modified 4.8 years ago by Prakki Rama2.2k • written 4.8 years ago by sam130
4
gravatar for Pierre Lindenbaum
4.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

blast each fasta file vs another and compile the results ?

ADD COMMENTlink written 4.8 years ago by Pierre Lindenbaum120k
1
gravatar for Alex Reynolds
4.8 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If the sequences are identical, you could use sequences as a hash table's keys and sequence-per-file counts as the hash table's values, incrementing the key's value only once per file.

ADD COMMENTlink written 4.8 years ago by Alex Reynolds28k

The sequences are not identical. Do you recommend setting a similarity threshold for 2 sequences to be identical?

ADD REPLYlink written 4.8 years ago by sam130

Perhaps calculate Levenshtein distance between pairs of sequences to build a distance matrix (or apply another distance metric). You might apply a threshold with stringency based on the variety of distances in your matrix. If your population of strings are similar, then the pool of distances will have low values and you'd perhaps want a stringent threshold. If strings are disparate, the pool of distances will have larger values, and a relaxed threshold could be applied to decide similarity. (BLAST will probably be the most efficient approach, not least because there are so many BLAST services out there to do it quickly.)

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Alex Reynolds28k
0
gravatar for Prakki Rama
4.8 years ago by
Prakki Rama2.2k
Singapore
Prakki Rama2.2k wrote:
Check if this post would suit your need. Create list of sequences present in multiple FASTA files
ADD COMMENTlink written 4.8 years ago by Prakki Rama2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2054 users visited in the last hour