"The human immune system produces a vast variety of antibodies in order to respond to external stimuli. Next-generation sequencing technology allows researchers to obtain the sequences of all antibodies from a single person. Clustering these antibody sequences allows us to understand the lineage structure of all human antibodies. Researchers can then further monitor the global changes of the human antibody repertoire in response to the different medical conditions, and identify the true acting antibodies with the potential to cure the disease.
However, the number of antibody sequences from a single sample can go up to millions. Clustering at such a large scale poses a major computational challenge.
The goal of this contest is to optimize an existing algorithm for clustering antibody sequences, which computes a pairwise distance matrix and then performs hierarchical clustering to group sequences into clusters. This algorithm is implemented in Python as the provided clonify_contest.py script (will be posted in the MM forum)"
As it is a topic I'm partially involved in (antibody sequencing), I would like to share some basic thoughts on this question.
First, if I've got this right, they are going to verify the contestants code against their algorithm, which they consider as a sort of "golden standard":
Their algorithm computes a distance score based on
Hamming distance between CDR3 regions
Variable/Joining segment match
Mismatches in Variable segment, i.e. somatic hypermutations
The verification is based on going through all pairs of antibodies and calculating the number of cases when contestants algorithm and the golden standard algorithm act concordantly, i.e. classify a pair to the same cluster / distinct clusters.
I have some points regarding the biological principles behind the golden standard algorithm:
The major point is that a biologically meaningful co-clustering should reflect same antigen specificity. The only way to do this would be to perform several cell sortings with various tetramers and mix the population. The antibodies that could be captured by the same tetramer should be co-clustered ("biological gold standard").
It appears that they have a separate scoring for V-segment mismatch and for hypermutations. Yet distinct V-segments could be actually closer that heavily hypermutated sequences of the same V-segment. This is especially important since they use V segment detalization up to alleles (i.e. IGHV5-1*01, *02, ...).
The hypermutations are recorded as nucleotide substitutions, while hamming distance is computed between CDR3 amino acid sequences. I think that either synonymous hypermutations should not be accounted for (this will require providing reference sequences) or CDR3 nucleotide sequences should be used.
There is also a rhetorical question, which positions in CDR3 region determine specificity (directly interact with an antigen) and which could be mutated without loosing specificity. For example see this review http://www.nature.com/nri/journal/v6/n12/full/nri1977.html. So I'm not very sure whether unweighted hamming distance is the best way to measure CDR3 similarity.