Transcriptome Assembly Of Illumina Reads On Low-Mem Machines?
4
5
Entering edit mode
13.0 years ago

Are there any tools/options to assemble transcriptome Illumina datasets on low memory machines? I know there are tools for machines with RAM in the hundreds of gigs, but I would like to know if there are any for low-mem workstation-like machines.

For example, for a dataset of 3 GA2 runs, with 7 lanes each run, 75bp PE, about 200GB in fastq files, for an insect species with no reference genome anywhere near (>100MYA)?

transcriptome assembly memory • 3.8k views
ADD COMMENT
3
Entering edit mode
13.0 years ago

trans-ABySS is designed to work on a cluster with each node having about 2GB of memory. It is not as memory intensive as other de Bruijn graph based methods which scale linearly with genome size. I've not tried it on a single machine (nor do I know if it is possible sorry) but it is worth a look.

ADD COMMENT
0
Entering edit mode

my understanding is that trans-ABYSS takes pre-assembled contigs as input, or is that only an optional input format other than the raw reads?

ADD REPLY
0
Entering edit mode

yes, you need to run ABySS on them first

ADD REPLY
1
Entering edit mode
13.0 years ago

I suppose you might want to look at CLC or Ray. This being said sometimes it's just easier to find someone with a big machine.

Most assemblers will tell you their transcriptome assemblies suffer from:

  1. splicing variants. With bigger kmers splicing variants don't introduce so many ambiguities. So use them.
  2. non-uniform coverage, though most won't explain why exactly that is deleterious. My understanding is that it comes down to high-coverage areas being treated as repeats, and coverage being made useless as a tiebreaker in ambiguous path traversal situations. At any rate there are a couple strategies available to normalize or flatten coverage. Obviously you'll want to remove exact duplicates, but an EST clustering approach (using Vmatch or other tools) might be handy as well.
ADD COMMENT
0
Entering edit mode

Has anybody tried Ray for transcriptome assembly?

ADD REPLY
1
Entering edit mode
13.0 years ago
Darked89 4.6k

Just random untested ideas:

reduce the complexity of the input set. Any sequencing errors likely increase the amount of RAM needed to hold them, so strict quality filtering may help.

error correction (but this may require comparable amount of RAM as the assembly) there are programs for k-mer based error corrections which according to authors improve the quality of genomic assemblies. This is likely to hold for transcript assembly.

instead of reference genome try "reference transcriptome". Hopefully you will find some ESTs, be it Sanger or 454 which can be assembled with less pain.

try to get at least mitochondrial sequences of your species or anything close. A lot of RNA-Seq matches it. Same goes for ribosomal sequences.

get even a small chunk of genomic sequence (say cosmid sized) with some repeats in it. Map with some large number of mismatches, filter out everything what maps to repetitive parts.

ADD COMMENT
0
Entering edit mode
13.0 years ago
Geparada ★ 1.5k

We want to do the same and we don't have a super high memory machine. So we probably use Cufflinks at galaxy public server or in galaxy on the cloud.

http://main.g2.bx.psu.edu/root?tool_id=cufflinks

has anybody tried Cufflinks standar alone or over galaxy?

ADD COMMENT
1
Entering edit mode

Cufflinks is not a de-novo assembler. It first requires alignment to a reference genome, then combines these aligned reads into transcripts.

ADD REPLY
0
Entering edit mode

Today (I think) it isn't a problem, because there are so many genomes available.

ADD REPLY

Login before adding your answer.

Traffic: 2693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6