Question: Transcriptome Assembly From 454 Data: Should Repeated Regions In Reads Be Masked Prior To Assembly?
6
gravatar for Eric Fournier
8.3 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:

I have in hands the results of a 454 sequencing experiment, and I am trying to reassemble the transciptome it represents. I would like to know what is the general consensus on whether repetitive elements within the reads should be masked prior to assembly.

My own tests seem to show that not masking the reads lead to erroneous constructs, but I am looking for better informed opinions or litterature that would help me wrap my head around the issue.

assembly repeats transcriptome • 2.6k views
ADD COMMENTlink written 8.3 years ago by Eric Fournier1.4k

I am very much interested in the answer for this question.

ADD REPLYlink written 8.3 years ago by 2184687-1231-83-5.0k
3
gravatar for Jeremy Leipzig
8.3 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

i see where you are coming from but i don't think this is a necessary step

repeats foil all assemblies but most assemblers (even Overlap Layout Consensus like Newbler) understand not to pursue overlaps greedily in the face of ambiguity, they just break the assembly

let's say you had two transcripts

A-R-B-R-C

A-R-D

And no pair spanned a repeat, I think you would probably get the following contigs:

A-R

R-B-R

R-C

R-D

So now at least you have some, albeit ambiguous, overlaps that could assist scaffolding. If you mask the repeats then you lose those, too.

That having been said I have not tried masking repeats, so you might be onto something.

ADD COMMENTlink written 8.3 years ago by Jeremy Leipzig18k
3
gravatar for Leonor Palmeira
8.3 years ago by
Leonor Palmeira3.7k
Li├Ęge, Belgium
Leonor Palmeira3.7k wrote:

What assemblers have you tried for this task? I have to say that, unless you are working on extremely repetitive sequences, and you know that the transcripts you are interested in are not in those regions, maybe you could consider masking them out.

But there are some very good assemblers out there which do a pretty good job at assembling repeats (like MIRA). Specially if you are working with long 454 reads (average length?).

Could you explain in more detail the strange behaviour you observe in your transcript constructs? We might then be able to circumvent the problem.

ADD COMMENTlink written 8.3 years ago by Leonor Palmeira3.7k

I've been using CAP3 for assembly. I've considered trying alternate assemblers (MIRA, especially), but I've not yet had the time to do so. However, regardless of the assembler being used, I figured that "to mask or not to mask" will still be a valid question and that I might as well make up my mind about it right now.

The 454 reads aren't particularly long. After cleaning and removal of short reads, they average ~270bp.

ADD REPLYlink written 8.3 years ago by Eric Fournier1.4k

I might have spoken too fast when talking about "Erroneous constructs", given that I haven't come up with any objective quality metrics. The assemblies certainly are different, though.

Some transcripts end up being (correctly) longer when using masked data. Other end up broken up in pieces, even though they do not contain repeated elements. Some of the transcripts with most reads in the non-masked assembly also are apparent misassemblies: repeated regions A-B-C form a single transcript, whereas they are only found as A-C in reference sequences.

ADD REPLYlink written 8.3 years ago by Eric Fournier1.4k

Regarding read length, I actually wanted to know the length of the repeats compared to the read length. Are we talking about much smaller reads than the read length or the other way around?

On the "erroneous" part, have you looked at the methods in the papers for the reference sequences? By this, I mean, that if they all have masked their repeats and observe A-C, this doesn't mean that A-C is the correct assembly, only that it is the assembly you obtain when repeat masking...

Side question : which assembler are you using? have you tested several?

ADD REPLYlink written 8.2 years ago by Leonor Palmeira3.7k
1
gravatar for Cbouyio
8.2 years ago by
Cbouyio10
Paris, FR
Cbouyio10 wrote:

All repeats in transcriptome reads come from transcriptionaly active retrotrasposons and you do not want them if your goal is to reconstruct the coding part of your organism of interest. I can see the merit of keeping them in genomic assemblies but it is just noise in the transcriptome assemblies. Again if you goal is NOT to study repeats I find no reason for including them in the reads that go to the assembler. Do a good search of your reads against an established repeat database of the organism of interest and clean up reads that matching repeats. The examples the previous contributors have presented (e.g. that repeats might help to bridge contigs) are valid for genomic assemblies and not for transcriptome.

ADD COMMENTlink written 8.2 years ago by Cbouyio10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1857 users visited in the last hour