Question: Beginner Question: Possible Use Biojava Or Something Similar To Compare Two Sequences And Get Just A Percent Number Of Differences
2
gravatar for User 2263
7.3 years ago by
User 226330
User 226330 wrote:

I have been using biojava and was able to load fasta files.

Does it make sense or do you think bioljava or something similar does a comparison between two sequences and just gives a percent number of similarities? Does that even make sense?

Here is the code I have been using for alignment, copy-pasted from the BioJava3 Cookbook:

package org.biojava3.alignment;

import java.net.URL;

import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
import org.biojava3.alignment.template.SequencePair;
import org.biojava3.alignment.template.SubstitutionMatrix;
import org.biojava3.core.sequence.ProteinSequence;
import org.biojava3.core.sequence.compound.AminoAcidCompound;
import org.biojava3.core.sequence.io.FastaReaderHelper;

public static void main(String[] args) {
    String[] ids = new String[] {"Q21691", "Q21495", "O48771"};
    try {
        alignPairLocal(ids[0], ids[1]);
    } catch (Exception e){
        e.printStackTrace();
    }
}

private static void alignPairLocal(String id1, String id2) throws Exception {
    ProteinSequence s1 = getSequenceForId(id1), s2 = getSequenceForId(id2);
    SubstitutionMatrix<AminoAcidCompound> matrix = new SimpleSubstitutionMatrix<AminoAcidCompound>();
    SequencePair<ProteinSequence, AminoAcidCompound=""> pair = Alignments.getPairwiseAlignment(s1, s2,
            PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix);
    System.out.printf("%n%s vs %s%n%s", pair.getQuery().getAccession(), pair.getTarget().getAccession(), pair);
}

private static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
    URL uniprotFasta = new URL(String.format("http://www.uniprot.org/uniprot/%s.fasta", uniProtId));
    ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
    System.out.printf("id : %s %s%n%s%n", uniProtId, seq, seq.getOriginalHeader());
    return seq;
}
biojava • 3.8k views
ADD COMMENTlink modified 7.3 years ago by Michael Dondrup46k • written 7.3 years ago by User 226330
1

Michael -- yes, BioJava 3 is where active development is happening now. The only downside is module coverage since it doesn't contain everything 1.8 did, but it already has quite a bit of functionality.

ADD REPLYlink written 7.3 years ago by Brad Chapman9.4k

I think that makes perfectly sense to get the percent sequence identity. I don't have bioperl here, but isn't there a method like ' getPercentIdenity'?

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k

Btw, biojava doesn't do the alignments itself you have to use an external program.

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k

Michael, BioJava 3 does have support for global/local pairwise and multiple alignment: http://biojava.org/wiki/BioJava:CookBook3:PSA with examples in the Cookbook: http://biojava.org/wiki/BioJava:CookBook3.0

ADD REPLYlink written 7.3 years ago by Brad Chapman9.4k

berlinbrowndev, you are looking for 'getNumIdenticals' or 'getNumSimilars' in the returned SequencePair http://www.biojava.org/docs/api/org/biojava3/alignment/template/SequencePair.html

ADD REPLYlink written 7.3 years ago by Brad Chapman9.4k

ok, I was looking at biojava 1, Brad do you know if biojava3 is already 'in good shape' to be used?

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k

Hey Berlin, just saw that you had copy pasted the code from the example in the cookbook, had you posted your sources, and pasted the complete code (missed the import part), you could have saved us some time here. Please remember next time, sorry no +1 here...

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k

+1 anyway for bringing up the biojava3 topic

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k
2
gravatar for Michael Dondrup
7.3 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Hi,

I think your question can be broken down into two questions:

  1. How to compute percent identity (PID) from a gapped local alignment and does it make sense?

  2. And how could this be done in BioJava?

Regarding 1: I would think it is hard to define a consistent way of defining the formula for this, because there are many possible ways to define it. I found this description of the problem, please have a look. This point I find most relevant:

PID is also strongly length dependent, so, the shorter a pair of sequences is, the higher the PID you might expect by chance.

Regarding 2: I played a bit with the new biojava, and the example seems to work nicely, here you have the methods you need.

Add the following to the method alignLocal:

double pid = 100 * pair.getNumIdenticals() / pair.getLength();
System.out.println("PID: " + pid);

I think dividing the number of matches by the total length of the alignment is a reasonable approach.


EDIT: Ok, ignore the biojava 1 example, because it doesn't work there anyways.

Regarding 2: Given you wish to use BLAST and parse the result, here is an example BLAST parser in the BioJava Cookbook. The following code fragment iterates over the hits of type SeqSimilaritySearchHit.

for (Iterator k = result.getHits().iterator(); k.hasNext(); ) {
              SeqSimilaritySearchHit hit =
                  (SeqSimilaritySearchHit)k.next();
              System.out.print("\tmatch: "+hit.getSubjectID());
              System.out.println("\te score: "+hit.getEValue());
              // maybe use the score instead?
              System.out.println("\te score: "+hit.getScore());

}

When you look at the API, there is no method getPercentIdentity, possibly for reasons in 1. Now, if you wanted to compute that yourself, while there are methods to get the query and subject start and end, there is no method that returns the number of exact matches or even the alignement text output as known from standard BLAST output.

Conclusion: Surprisingly, it is not possible to do this using BLAST and BioJava alone (you needed to parse the textual BLAST output). A possible alternative might be to use the Blast score.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1604 users visited in the last hour