I am mapping a large number of other species ESTs to a draft of novel genome. These EST sequences are contaminated with vectors, adaptors, ribosomal RNA, plagued by stretches of low-complexity sequence artifacts etc. I already run SeqClean, but I am bewildered by its ability to leave i.e. 100 starting nucleotides intact which match @98-99% bunch of known vectors. Same goes for things easy to spot (ribosomal sequences and low complexity). I assumed that such artifacts will not be easy mapped, but somehow GMAP manages to map them anyway in a --tolerant mode.
I did not benchmark it yet, but in the fairly distant past if memory serves me right, I was getting more reliable output using pregap4 from Staden. There is also a new tool called SeqTrim. Has anybody used that one already? Can you recommend anything else?
EDIT: With the default sequence library SeqTrim is more strict than SeqClean (5539 vs 6522 non-zero length sequences out of 6693). Using the same EST set and the same GMAP settings 10464 vs 12992 cDNA_matches (after extra step of removing ribosomal RNA sequences). SeqTrim does cut sometimes reasonably looking EST (i.e. GT153378.1) to zero. On the other hand it kicks out rRNA quite well (just one EST missed vs 52 missed by SeqClean).