Question: Ests Cleanup: Seqclean Alternatives?
gravatar for Darked89
8.7 years ago by
Barcelona, Spain
Darked894.2k wrote:

I am mapping a large number of other species ESTs to a draft of novel genome. These EST sequences are contaminated with vectors, adaptors, ribosomal RNA, plagued by stretches of low-complexity sequence artifacts etc. I already run SeqClean, but I am bewildered by its ability to leave i.e. 100 starting nucleotides intact which match @98-99% bunch of known vectors. Same goes for things easy to spot (ribosomal sequences and low complexity). I assumed that such artifacts will not be easy mapped, but somehow GMAP manages to map them anyway in a --tolerant mode.

I did not benchmark it yet, but in the fairly distant past if memory serves me right, I was getting more reliable output using pregap4 from Staden. There is also a new tool called SeqTrim. Has anybody used that one already? Can you recommend anything else?

EDIT: With the default sequence library SeqTrim is more strict than SeqClean (5539 vs 6522 non-zero length sequences out of 6693). Using the same EST set and the same GMAP settings 10464 vs 12992 cDNA_matches (after extra step of removing ribosomal RNA sequences). SeqTrim does cut sometimes reasonably looking EST (i.e. GT153378.1) to zero. On the other hand it kicks out rRNA quite well (just one EST missed vs 52 missed by SeqClean).

sequence est • 3.2k views
ADD COMMENTlink modified 8.6 years ago by Stephanie0 • written 8.7 years ago by Darked894.2k
gravatar for Haibao Tang
8.7 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

I had also used Lucy before for removing vectors and low-quality nucleotides from Sanger reads.

ADD COMMENTlink written 8.7 years ago by Haibao Tang3.0k
gravatar for Bach
8.7 years ago by
Bach550 wrote:

For Sanger sequences, pregap4 actually is a good and very configurable tool, why not stick to it? You can even write small plugins to include new or own programs/filters into its functionality. Seqtrim does not look bad, but I never used it.

Apart from that, people I know use very different pipelines where pregap4, lucy, cross_match, blast, SSAHA2 and SMALT are among the most often encountered for Sanger. For 454, it's mostly the Roche pipeline (perhaps supported by SSAHA2/SMALT) while for Illumina I've seen SSAHA2, SMALT and the FASTX toolkit.

ADD COMMENTlink modified 8.7 years ago • written 8.7 years ago by Bach550
gravatar for James Hane
8.7 years ago by
James Hane20
James Hane20 wrote:

Seqclean can be pretty good if you modify the psx file where it calls the blast executable, to have the same parameters as NCBI VecScreen (-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12). After this it should reproduce the same results and VecScreen, which is pretty sensitive.

ADD COMMENTlink written 8.7 years ago by James Hane20
gravatar for Vashar
8.6 years ago by
Vashar20 wrote:

Seqtrim results is same as seqclean withoout -v and -s options.

ADD COMMENTlink written 8.6 years ago by Vashar20
gravatar for Ketil
8.3 years ago by
Ketil4.0k wrote:

I hope you made it work out! I have tried a bunch of vector screening tools, and not found one I am happy with. Whatever you end up using, make sure you verify the results, for instance by BLASTing against vector, linker and adaptor sequences. Especially those short, synthetic sequences tend to show up unexpectedly, and since they are small, will give very high E-values.

ADD COMMENTlink written 8.3 years ago by Ketil4.0k
gravatar for Stephanie
8.2 years ago by
Stephanie0 wrote:

I am trying to use Seqtrim, but after a while i always get an out of memory message, after which the programme shuts down:

Out of memory! Callback called exit at /software/shared/apps/x86_64/perl/5.8.9/lib/site_perl/5.8.9/Bio/ line 676, [?] line 40615748.

Does anyone know about a decent tool that uses less memory? It is however for illumina genome sequence data. Whenever I try Seqtrim for a smaller file, it does work.


ADD COMMENTlink written 8.2 years ago by Stephanie0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1664 users visited in the last hour