Beginner Question: Possible Use Biojava Or Something Similar To Compare Two Sequences And Get Just A Percent Number Of Differences
1
2
Entering edit mode
12.3 years ago
User 2263 ▴ 30

I have been using biojava and was able to load fasta files.

Does it make sense or do you think bioljava or something similar does a comparison between two sequences and just gives a percent number of similarities? Does that even make sense?

Here is the code I have been using for alignment, copy-pasted from the BioJava3 Cookbook:

package org.biojava3.alignment;

import java.net.URL;

import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
import org.biojava3.alignment.template.SequencePair;
import org.biojava3.alignment.template.SubstitutionMatrix;
import org.biojava3.core.sequence.ProteinSequence;
import org.biojava3.core.sequence.compound.AminoAcidCompound;
import org.biojava3.core.sequence.io.FastaReaderHelper;

public static void main(String[] args) {
    String[] ids = new String[] {"Q21691", "Q21495", "O48771"};
    try {
        alignPairLocal(ids[0], ids[1]);
    } catch (Exception e){
        e.printStackTrace();
    }
}

private static void alignPairLocal(String id1, String id2) throws Exception {
    ProteinSequence s1 = getSequenceForId(id1), s2 = getSequenceForId(id2);
    SubstitutionMatrix<AminoAcidCompound> matrix = new SimpleSubstitutionMatrix<AminoAcidCompound>();
    SequencePair<ProteinSequence, AminoAcidCompound=""> pair = Alignments.getPairwiseAlignment(s1, s2,
            PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix);
    System.out.printf("%n%s vs %s%n%s", pair.getQuery().getAccession(), pair.getTarget().getAccession(), pair);
}

private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
    URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
    ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
    System.out.printf("id : %s %s%n%s%n", uniProtId, seq, seq.getOriginalHeader());
    return seq;
}
biojava biojava • 5.0k views
ADD COMMENT
1
Entering edit mode

Michael -- yes, BioJava 3 is where active development is happening now. The only downside is module coverage since it doesn't contain everything 1.8 did, but it already has quite a bit of functionality.

ADD REPLY
0
Entering edit mode

I think that makes perfectly sense to get the percent sequence identity. I don't have bioperl here, but isn't there a method like ' getPercentIdenity'?

ADD REPLY
0
Entering edit mode

Btw, biojava doesn't do the alignments itself you have to use an external program.

ADD REPLY
0
Entering edit mode

Michael, BioJava 3 does have support for global/local pairwise and multiple alignment: http://biojava.org/wiki/BioJava:CookBook3:PSA with examples in the Cookbook: http://biojava.org/wiki/BioJava:CookBook3.0

ADD REPLY
0
Entering edit mode

berlinbrowndev, you are looking for 'getNumIdenticals' or 'getNumSimilars' in the returned SequencePair http://www.biojava.org/docs/api/org/biojava3/alignment/template/SequencePair.html

ADD REPLY
0
Entering edit mode

ok, I was looking at biojava 1, Brad do you know if biojava3 is already 'in good shape' to be used?

ADD REPLY
0
Entering edit mode

Hey Berlin, just saw that you had copy pasted the code from the example in the cookbook, had you posted your sources, and pasted the complete code (missed the import part), you could have saved us some time here. Please remember next time, sorry no +1 here...

ADD REPLY
0
Entering edit mode

+1 anyway for bringing up the biojava3 topic

ADD REPLY
2
Entering edit mode
12.3 years ago
Michael 54k

Hi,

I think your question can be broken down into two questions:

  1. How to compute percent identity (PID) from a gapped local alignment and does it make sense?

  2. And how could this be done in BioJava?

Regarding 1: I would think it is hard to define a consistent way of defining the formula for this, because there are many possible ways to define it. I found this description of the problem, please have a look. This point I find most relevant:

PID is also strongly length dependent, so, the shorter a pair of sequences is, the higher the PID you might expect by chance.

Regarding 2: I played a bit with the new biojava, and the example seems to work nicely, here you have the methods you need.

Add the following to the method alignLocal:

double pid = 100 * pair.getNumIdenticals() / pair.getLength();
System.out.println("PID: " + pid);

I think dividing the number of matches by the total length of the alignment is a reasonable approach.


EDIT: Ok, ignore the biojava 1 example, because it doesn't work there anyways.

Regarding 2: Given you wish to use BLAST and parse the result, here is an example BLAST parser in the BioJava Cookbook. The following code fragment iterates over the hits of type SeqSimilaritySearchHit.

for (Iterator k = result.getHits().iterator(); k.hasNext(); ) {
              SeqSimilaritySearchHit hit =
                  (SeqSimilaritySearchHit)k.next();
              System.out.print("\tmatch: "+hit.getSubjectID());
              System.out.println("\te score: "+hit.getEValue());
              // maybe use the score instead?
              System.out.println("\te score: "+hit.getScore());

}

When you look at the API, there is no method getPercentIdentity, possibly for reasons in 1. Now, if you wanted to compute that yourself, while there are methods to get the query and subject start and end, there is no method that returns the number of exact matches or even the alignement text output as known from standard BLAST output.

Conclusion: Surprisingly, it is not possible to do this using BLAST and BioJava alone (you needed to parse the textual BLAST output). A possible alternative might be to use the Blast score.

ADD COMMENT

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6