Question: How To Deal With Un-Used Reads After De Novo Assembly?
9
gravatar for Lhl
9.0 years ago by
Lhl730
United States
Lhl730 wrote:

Hi All,

I have been trying to combine all genomic resources produced by different sequencing platforms in our lab and assemble them into contigs.

Since we do not have reference genome for our species, we did de novo assembly.

When we finish the assembly, we still have a lot of un-used reads (~30GB).

It doesn't seem to be very reasonable to simply discard them. But i do not know what should i do to take advantage of them.

I am wondering if anyone of you has similar experience and know how to do with it.

Thanks in advance for your valuable suggestions and discussions!

assembly denovo read • 4.0k views
ADD COMMENTlink modified 8.6 years ago by Michael Dondrup47k • written 9.0 years ago by Lhl730
5
gravatar for Michael Dondrup
9.0 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

In addition to contamination I would consider these possibilities:

  • Reads from highly repetitive regions. These can cause problems with the assembly and the software might have removed them in preprocessing, repetitive sequences could be checked with e.g. a low-complexity filter or repeat finders (e.g. dust, repeatmasker)
  • Low quality reads. Check the base quality, and get some statistics about the reads using e.g. the FastQC tool (maybe start with a subset)
  • Singleton reads, e.g. from low coverage regions, if the read does not overlap with anything else it cannot be assembled
  • Reads that are contaminated with vector sequences, check Blast against a vector database
  • The sample is contaminated with a microorganism (alchemixt), however if the contaminant genome is small compared to the target, it might happen that the contaminant reads assemble better than the target genome. A blast search against NT might indeed help.

It is a bit hard to speculate further without knowing more details.

Edit: another idea would be to try a 'meta-genomics' approach on the left over reads if you suspect contamination. E.g. do blast against NT or NR and use MEGAN to classify the reads.

Edit2: I have been looking a bit more into validation of assemblies, and came across the AMOS genome assembly validation tool. On their website, there is a very relevant cite supporting the contamination hypothesis but also raising an additional interesting point:

Unused read information - Not all reads provided as input to an assembler are used in the final assembly. The unused reads, also called singletons, are often contaminants or insufficiently trimmed reads from the genome. Mis-assemblies, however, also lead to the presence of unused reads, as they are inconsistent with the chosen reconstruction of the genome. As an example, the reads spanning the join point of two copies of a tandem repeat are listed as singletons when the assembler incorrectly collapses this repeat. By aligning the singletons to the contigs produced by the assembler we can identify such misassemblies.

I haven't tried AMOS yet but possibly will do so soon.

ADD COMMENTlink modified 8.6 years ago • written 9.0 years ago by Michael Dondrup47k

thanks so much for offering this many possibilities. In fact, i have already filtered out potential contaminations. And before assembly i did quality check and filtering. So i think, as you pointed out the effects of lower coverage and repetition could be the reason. I will check that. Many thanks Michael.

ADD REPLYlink written 9.0 years ago by Lhl730
2
gravatar for Jeremy Leipzig
9.0 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

I have found that crucial reads can be held hostage in spurious contigs that go nowhere when you use an insufficient coverage cutoff. Raising the cutoff can break up these bad contigs, allowing them to join good ones and in turn recruit more reads.

alt text

By the way, reads from repetitive regions such as retrotransposons generally do make it into assemblies (into small contigs of incredibly high depth), but they destroy any possibility of determining gene order. That is why people (especially on the plant side) still sometimes rely on sanger to "sequence through repeats" - i.e. connect regions flanking repeats.

ADD COMMENTlink written 9.0 years ago by Jeremy Leipzig19k

This is informative. Many Thanks, Jeremy.

ADD REPLYlink written 9.0 years ago by Lhl730

How did u get different bubbles in ggplot ? Do you mind sharing the command. Thanx!

ADD REPLYlink written 9.0 years ago by Curiosity120

http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/refReport.Rnw

ADD REPLYlink modified 12 months ago by RamRS30k • written 9.0 years ago by Jeremy Leipzig19k
1
gravatar for Herefordguy
9.0 years ago by
Herefordguy10
United States
Herefordguy10 wrote:

In addition to the other valuable comments, have you evaluated a different assembler? Some assembliers (ALLPATHS-LG, MSR-CA, etc) do a better job of error correction than others, and thus use more of the data.

ADD COMMENTlink written 9.0 years ago by Herefordguy10

To date, I only tried Ray and Velvet. Both of these two yield lots of Un-used reads. I will be happy to try other assemblers and see if they work out better. Thanks for your suggestion!

ADD REPLYlink written 9.0 years ago by Lhl730
1
gravatar for ALchEmiXt
9.0 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

In addition to Michael's answer. You might want to check for badly CLIPped sequences. For instance adapters at wrong locations..... Illumina mate-pair PE libraries are famous for those artefacts.

We routinely check against a set of bowtie genome indices including vector (as suggested) but also a DB with commonly used adapters. You'll be surprised.....in the bad way...

Have a look at for instance the fastq_screen tool

ADD COMMENTlink written 9.0 years ago by ALchEmiXt1.9k
1

@lhl: Have a look at the link for fastq_screen it details how to handle it. We have for many genomes some botie indices on the server anyway and allow the user to select the databases to screen for in their particular case. Separate we have UNIVEC, PhyX (incl a extra region spanning the origin) and an adapterDB. Those adapter sequences can be retrieved from that link as well (have a look in the readme). Otherwise PM me and I can send it. It's quite small.

ADD REPLYlink written 9.0 years ago by ALchEmiXt1.9k

Hi ALchEmiXt,

This is helpful.

A quick question - about the sequence databases against which i will use my raw reads to search. should it be a database containing all bacterial nucleotide + adaptor sequences? And how to get all the adaptor sequences?

Thanks a lot!

ADD REPLYlink written 9.0 years ago by Lhl730

thanks very much ALchEmiXt. Cheers -- lhl

ADD REPLYlink written 9.0 years ago by Lhl730
0
gravatar for ALchEmiXt
9.0 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

Dit you check a sub-set of the non-assembled reads what it is (e.g. by BLAST)? Even though no reference is available....you might not be the first to be surprised by a contaminating yeast or bacterium...If that is the case; the answer is simple: trash it.

ADD COMMENTlink written 9.0 years ago by ALchEmiXt1.9k

Hi AlchEmiXT,
Yes i already did the blast to remove potential contamination! But thanks for your response.! Cheers

ADD REPLYlink written 9.0 years ago by Lhl730

If contamination, reads should assemble in contig anyway. Am I wrong?

ADD REPLYlink written 9.0 years ago by Frédéric Bigey290

If contaminants introduce ambiguity into the graph they can disturb an assembly.

ADD REPLYlink written 9.0 years ago by Jeremy Leipzig19k

I did an assembly without trimming the contamination, i did get some bacterial contigs. In fact, they are quite long!

ADD REPLYlink written 9.0 years ago by Lhl730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1747 users visited in the last hour