In a plant genome project I got a draft assembly (> 500Mbp, >500k contigs). A number of contigs is no doubt bacterial in origin.
There are at least 3 peaks when it comes to GC content (40% - my plant, 50% largest contig, 65-70% another group).
Blastn takes ages, and there is no point of doing it every time we change assembler parameters even slightly. So while rather sooner than later I will have to split 454 sff files into my_plant vs not_my_plant, I will still need a faster method of classifying contigs to not_my_plant group.
In metagenomics this is often being done by calculating k-mer frequencies, see i.e (not supported anymore) TETRA: http://www.megx.net/tetra/ (see the manual for the algorithm)
Do you use any program for fast clustering/classification of sequences from say 150bp to 1Mbp using k-mer frequencies?