I have assembled a (C. Elegans) genome from raw dna-seq reads and I have come up with (repeat-masked) fasta file of scaffolds. I aligned a random EST seq onto the scaffolds using blast, thus I have a plain text or xml file with the alignments.
I want to go on following the "A beginner’s guide to eukaryotic genome annotation" guide, by Mark Yandell and Daniel Ence, which mentions (about processing blast result):
"... the remaining data are sometimes clustered to identify overlapping alignments and predictions. Clustering has two purposes. First, it groups diverse computational results into a single cluster of data, all supporting the same gene. Second, it identifies and purges redundant evidence; highly expressed genes, for example, may be supported by hundreds if not thousands of identical ESTs."
I can only image the two aforementioned cases as the same case. I mean getting multiple ESTs aligned onto a specific gene is overlapping results that could be clustered together. What else could the first case ( "diverse results all supporting the same gene" ) refer to? Isn't it the same thing?