What Is A Good Way To Implement Getting A Consensus Sequence In Java?
2
2
Entering edit mode
11.1 years ago
Brandstaetter ▴ 270

I have the following problem:

  • I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two spots.
  • Finding the differences is trivial, so let's just ignore that
  • for each difference, I want to get the consensus symbol (e.g. M for A or C) that represents both possibilities

I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.

What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).

Question also on stackoverflow.

Update: snippet of the solution, thanks to Michael Dondrup (you can see the symmetry in the matrix, which makes error checking much easier):

enum lut {
     AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
     CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
     GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
     TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
     RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), ...
     YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), ...
     KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), ...
     MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), ...
     SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), ...
     WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), ...
     BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), ...
     DA('D'), ...
     HA('H'), ...
     VA('V'), ...
     NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');

     char consensusChar = 'X';

     lut(char c) {
         consensusChar = c;
     }

     char getConsensusChar() {
         return consensusChar;
     }
}


char getConsensus(char a, char b) {
    return lut.valueOf("" + a + b).getConsensusChar();
}
sequence java biojava • 2.9k views
ADD COMMENT
1
Entering edit mode
11.1 years ago

You can make something efficient using bitwise operators. Consider your consensus string as an array of bit fields. Use | to set bits at e.g. pos 1 for A, pos 2 for T, pos 3 for G and pos 4 for C. At the end you can use the numeric values for each bit field to look up the correct ambiguity code in a bit map.

ADD COMMENT
1
Entering edit mode
11.1 years ago

There are a lot of more or less clean options:

Iterate over the string by position, then for each position different

Java 7 supports strings in switch case statements, concatenate the two different characters into a string and use some construct like:

 String diff = seq1.substring(i) + seq2.substring(i);
 String out;
 switch (diff) {
       case "GT": out = "K"  ;
                break;
       case "AT": out = "W" ;
               break;
 ...
        }

Or:

You can make an enum type:

enum Ambigous {
        AA("A"), AT("W"), TA("W"), GT("K"); // ...
        String ambigousLetter = "";

        Ambigous(String s) {
            ambigousLetter = s;
        };

        String getAmbigousLetter() {
            return ambigousLetter;
        }
    }
String out = Ambigous.valueOf(seq1.substring(i) + seq2.substring(i)).getAmbigousLetter();

Or: use a HashMap:

import java.util.HashMap;
...

HashMap<String, String> ambCodes = new HashMap<String, String>();
// fill the map:
ambCodes.put( "AT", "W");

// later on:        
ambCodes.get("AT")
ADD COMMENT
0
Entering edit mode

I'll use the enum, because that one is the most clean one to maintain. Thanks!

ADD REPLY
0
Entering edit mode

I think that's the most clean choice, too. Good luck.

ADD REPLY

Login before adding your answer.

Traffic: 1347 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6