Ests Cleanup: Seqclean Alternatives?
5
2
Entering edit mode
13.4 years ago
Darked89 4.6k

I am mapping a large number of other species ESTs to a draft of novel genome. These EST sequences are contaminated with vectors, adaptors, ribosomal RNA, plagued by stretches of low-complexity sequence artifacts etc. I already run SeqClean, but I am bewildered by its ability to leave i.e. 100 starting nucleotides intact which match @98-99% bunch of known vectors. Same goes for things easy to spot (ribosomal sequences and low complexity). I assumed that such artifacts will not be easy mapped, but somehow GMAP manages to map them anyway in a --tolerant mode.

I did not benchmark it yet, but in the fairly distant past if memory serves me right, I was getting more reliable output using pregap4 from Staden. There is also a new tool called SeqTrim. Has anybody used that one already? Can you recommend anything else?

EDIT: With the default sequence library SeqTrim is more strict than SeqClean (5539 vs 6522 non-zero length sequences out of 6693). Using the same EST set and the same GMAP settings 10464 vs 12992 cDNA_matches (after extra step of removing ribosomal RNA sequences). SeqTrim does cut sometimes reasonably looking EST (i.e. GT153378.1) to zero. On the other hand it kicks out rRNA quite well (just one EST missed vs 52 missed by SeqClean).

sequence est • 4.5k views
ADD COMMENT
0
Entering edit mode

I am trying to use Seqtrim, but after a while i always get an out of memory message, after which the programme shuts down:

Out of memory!
Callback called exit at /software/shared/apps/x86_64/perl/5.8.9/lib/site_perl/5.8.9/Bio/SeqIO.pm line 676, [?] line 40615748.

Does anyone know about a decent tool that uses less memory?

It is however for illumina genome sequence data. Whenever I try Seqtrim for a smaller file, it does work.

anyone?

ADD REPLY
3
Entering edit mode
13.4 years ago

I had also used Lucy before for removing vectors and low-quality nucleotides from Sanger reads.

ADD COMMENT
2
Entering edit mode
13.4 years ago
Bach ▴ 550

For Sanger sequences, pregap4 actually is a good and very configurable tool, why not stick to it? You can even write small plugins to include new or own programs/filters into its functionality. Seqtrim does not look bad, but I never used it.

Apart from that, people I know use very different pipelines where pregap4, lucy, cross_match, blast, SSAHA2 and SMALT are among the most often encountered for Sanger. For 454, it's mostly the Roche pipeline (perhaps supported by SSAHA2/SMALT) while for Illumina I've seen SSAHA2, SMALT and the FASTX toolkit.

ADD COMMENT
2
Entering edit mode
13.3 years ago
James Hane ▴ 20

Seqclean can be pretty good if you modify the psx file where it calls the blast executable, to have the same parameters as NCBI VecScreen (-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12). After this it should reproduce the same results and VecScreen, which is pretty sensitive.

ADD COMMENT
0
Entering edit mode
13.2 years ago
Vashar ▴ 20

Seqtrim results is same as seqclean withoout -v and -s options.

ADD COMMENT
0
Entering edit mode
13.0 years ago
Ketil 4.1k

I hope you made it work out! I have tried a bunch of vector screening tools, and not found one I am happy with. Whatever you end up using, make sure you verify the results, for instance by BLASTing against vector, linker and adaptor sequences. Especially those short, synthetic sequences tend to show up unexpectedly, and since they are small, will give very high E-values.

ADD COMMENT

Login before adding your answer.

Traffic: 1765 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6