Question: Comparison between multiple fasta files in terms of presence or absence of certain sequences
gravatar for Bernhard.T.Werner
2.6 years ago by
Bernhard.T.Werner0 wrote:

Hallo folks.

I am trying to compare fasta files of predicted miRNAs with one another. This problem does not seem to be very hard for someone more skilled in bioinformatics than me but after hours of forum searches I could not solve the problem.

As an example I have 3 fasta files like:







From these files I would like to get a table that shows the presence or absence of each sequence. As you can see the same sequence could have a different ID. In the best case even if there is one mismatch between the sequences they should be counted as equal.

I am sorry to ask such a simple question but i couldnt figure it out by myself. So every help is welcome.

The closest solution i found so far is: Count Sequences In Multi Fasta File .

The script Kenosis provided does a part of the job:


use strict;
use warnings;
use File::Basename;
use Sort::Naturally;

local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);

while (<>) {
    $query{ $F[0] } = $F[1] if @F = split;

local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;

while (<>) {
    next if /^>/;

print join( "\t", 'tag', 'seq', @files ), "\n";

for my $key ( nsort keys %query ) {
    my @seqCount;
    push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
    print join( "\t", $key, $query{$key}, @seqCount ), "\n";


But unfortunately I am unable to adept it to my problem. Because the script compares the sequences by ID and does not allow mismatches it produces many duplicate sequences in the output.

rna-seq alignment sequence • 1.2k views
ADD COMMENTlink modified 2.6 years ago by WouterDeCoster43k • written 2.6 years ago by Bernhard.T.Werner0

Are these long sequences? How many sequences per file? Are they in single-line format, or multi-line?

ADD REPLYlink written 2.6 years ago by

The sequences are short <30bp and the files are in single line format.

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0

When applying the 1 mismatch between sequences, does this count for length too?

ADD REPLYlink written 2.6 years ago by

Yes, this would be perfect.

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 2.6 years ago by WouterDeCoster43k

I usually like blastclust which is now a legacy software but this post recommending cd-hit might be useful How to cluster sequences present in multiple fasta files

ADD REPLYlink written 2.6 years ago by microfuge1.5k

Thx for the help. The cd-hit webserver seems to take forever for a small job and can only handle two files. So I dont think this will solve my problem

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0
gravatar for
2.6 years ago by
Philadelphia, PA wrote:

Others may have command-line solutions. Here's one in python, untested, that I think should do what you need.

Fasta files:







First, make a file with all the unique sequences from each of your fasta files:

cat file1.fasta file2.fasta file3.fasta | grep -v '>' | sort | uniq > all.txt

Save this code as , and run as python > output.xls in the location that has the three fasta files, and the new file with all the sequences:

#!/usr/bin/env python
import regex
from collections import defaultdict
from itertools import izip

def open_fasta(file):
        """ Open fasta file, and store in dictionary, as {'sequence:header'} """
        s = {}
        with open(file, 'r') as fasta:
                for line in fasta:
                        if line.startswith('>'):
                                h = line.strip().split('>')[1]
                                n = next(fasta).strip()
                                s[n] = h
        return s

def hamming_distance(s1, s2):
    """ Count number of mismatched characters in equal length strings. """
    if len(s1) != len(s2): raise ValueError('string lengths do not match')
    return sum(a != b for a, b in izip(s1, s2))

def find_seqs(fh, p):
        """ Find sequences from 'all sequence list' that either match as identical sequence, or mismatch of 1, store as defaultdict [{sequence:[id1, id2, id3...]}] """
        seqs = open_fasta(fh)
        for i in seqs:
                if i in x:
                        match[i][p] = seqs[i]
                        for j in x:
                                if len(i) == len(j):
                                        if hamming_distance(i, j) <= 1:
                                                match[i][p] = seqs[i] + '+1'

# Open concated fasta file, containing unique sequences
with open('all.txt', 'r') as a:
        x = [line.strip() for line in a]

# empty defaultdict for reulsts
match = defaultdict(list)

for i in x:
        for n in range(3):

# Open each fasta, and search for sequences in 'all sequence list'
find_seqs('one.fasta', 0)
find_seqs('two.fasta', 1)
find_seqs('three.fasta', 2)

# Print results in tab delimited format
print 'Sequence\tFile1_ID\tFile2_ID\tFile3_ID'
for m in match:
        print m, '\t', '\t'.join(match[m])

Output from example above:

Sequence    File1_ID    File2_ID    File3_ID
AGTCGTCA    miRNA1  miRNA2  
TTGACT      miRNA1  miRNA5
AGTTCGTT    miRNA2      
AGTCGTGA    miRNA3      miRNA3
TTACATT             miRNA3
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by

Thank you for your help. The problem is it is not working properly. The output it produces is:

defaultdict(<type 'list'>, {'AGTCGTCA': ['miRNA1', 'miRNA2'], 'AGTTCGTT': ['miRNA2'], 'TTACATT': ['miRNA3'], 'TTGACT': ['miRNA1', 'miRNA5']})
Sequence    File1 ID    File2 ID    File3 ID

So all the ID are in the left most column.

I tried to change the script so it produces a csv-file with this result:

defaultdict(<type 'list'>, {'AGTCGTCA': ['miRNA1', 'miRNA2'], 'AGTTCGTT': ['miRNA2'], 'TTACATT': ['miRNA3'], 'TTGACT': ['miRNA1', 'miRNA5']})
Sequence,File1 ID,File2 ID,File3 ID

Thanks again for your efforts. But the script does not sort the IDs to the right column, or am I doing something wrong?

But dont get me wrong, this script is already helping a lot by filtering out nearly duplicate sequences. I only wish the IDs would be in a column for each file so I could figure out how many and which sequences are present in which file.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Bernhard.T.Werner0

@Bernhard I edited the code above, and added the example files I'm using. I am also now piping the output to a output.xls, which by the extension will open in excel. Let me know if this is in working order, and if so, please accept the answer.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by

Thank you so much! This is perfect^^

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0
gravatar for TriS
2.6 years ago by
United States, Buffalo
TriS4.1k wrote:

two approaches you could try are in perl and unix perl: from previous question in biostar UNIX: AWK solution

ADD COMMENTlink written 2.6 years ago by TriS4.1k

These solutions are a bit over my head. I am not good at programing. I am able to use a script or to combine several scripts and programs. But unfortunately I only understand the simplest scripts. Thx for helping.

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0

You can try using Sort in MS Excel.

ADD REPLYlink written 2.6 years ago by theobroma221.1k

Yes, I could do this. But I have files with hundreds of sequences and to compare them by hand takes ages. That is why I would like to find an automated solution.

ADD REPLYlink written 2.6 years ago by Bernhard.T.Werner0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1746 users visited in the last hour