What Is A Good Way To Implement Getting A Consensus Sequence In Java?
2
2
Entering edit mode
11.1 years ago
Brandstaetter ▴ 270

I have the following problem:

• I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two spots.
• Finding the differences is trivial, so let's just ignore that
• for each difference, I want to get the consensus symbol (e.g. M for A or C) that represents both possibilities

I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.

What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).

Question also on stackoverflow.

Update: snippet of the solution, thanks to Michael Dondrup (you can see the symmetry in the matrix, which makes error checking much easier):

enum lut {
AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), ...
YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), ...
KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), ...
MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), ...
SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), ...
WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), ...
BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), ...
DA('D'), ...
HA('H'), ...
VA('V'), ...
NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');

char consensusChar = 'X';

lut(char c) {
consensusChar = c;
}

char getConsensusChar() {
return consensusChar;
}
}

char getConsensus(char a, char b) {
return lut.valueOf("" + a + b).getConsensusChar();
}

sequence java biojava • 2.9k views
1
Entering edit mode
11.1 years ago

You can make something efficient using bitwise operators. Consider your consensus string as an array of bit fields. Use | to set bits at e.g. pos 1 for A, pos 2 for T, pos 3 for G and pos 4 for C. At the end you can use the numeric values for each bit field to look up the correct ambiguity code in a bit map.

1
Entering edit mode
11.1 years ago

There are a lot of more or less clean options:

Iterate over the string by position, then for each position different

Java 7 supports strings in switch case statements, concatenate the two different characters into a string and use some construct like:

 String diff = seq1.substring(i) + seq2.substring(i);
String out;
switch (diff) {
case "GT": out = "K"  ;
break;
case "AT": out = "W" ;
break;
...
}


Or:

You can make an enum type:

enum Ambigous {
AA("A"), AT("W"), TA("W"), GT("K"); // ...
String ambigousLetter = "";

Ambigous(String s) {
ambigousLetter = s;
};

String getAmbigousLetter() {
return ambigousLetter;
}
}
String out = Ambigous.valueOf(seq1.substring(i) + seq2.substring(i)).getAmbigousLetter();


Or: use a HashMap:

import java.util.HashMap;
...

HashMap<String, String> ambCodes = new HashMap<String, String>();
// fill the map:
ambCodes.put( "AT", "W");

// later on:
ambCodes.get("AT")

0
Entering edit mode

I'll use the enum, because that one is the most clean one to maintain. Thanks!

0
Entering edit mode

I think that's the most clean choice, too. Good luck.