Question: What are the best approaches to evaluate a genome assembly using the 'intrinsic' data?
3
gravatar for fhsantanna
4.6 years ago by
fhsantanna440
Brazil
fhsantanna440 wrote:

I have assembled four bacterial genomes derived from MiSeq pair-ended sequencing data using the following steps:

1 - Assembly using CLC Workbench;

2 - Assembly using SPADES;

3 - Assembly using A5 pipeline;

4 - Merging of the three assembles using CISA;

5 - Quality check of the assemblies using QUAST.

For checking the misassemblies, QUAST relies on a reference genome. However, for most of my draft genomes, I do not have a proper reference genome (too much genome differences in relation to those deposited in Genbank).

So, I ask you. How could I validate the genome assembly using intrinsic data? For example, using read mapping, what are the criteria to correct some regions? What is the best software for this purpose?

Thanks.

 

validation assembly genome • 4.4k views
ADD COMMENTlink modified 4.6 years ago by lexnederbragt1.2k • written 4.6 years ago by fhsantanna440
3
gravatar for Leszek
4.6 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

I don't know any ad hoc solution. But you can try looking at:

  • fraction of reads that aligned - if many reads didn't aligned you probably miss some regions in your assembly
  • fraction of reads with concordant pairing (ie samtools flagstat) - if this is low, you have likely rearrangements or high genome fragmentation
  • pairwise genome alignments (ie. nucmer or lastal) of your assemblies to check for large inconsistencies between them

It's always good to compare vs chromosomes of some relative species to check whether your assembly make sense. 

ADD COMMENTlink written 4.6 years ago by Leszek4.0k

Should I use corrected reads or brute ones? I have used the brute ones on the contigs and most of them were not mapped...

ADD REPLYlink written 4.6 years ago by fhsantanna440

I use raw reads, as modern aligners are quite good at aligning even poor quality reads. If a lot of your reads fail to align, it doesn't necessarily mean your assembly is wrong. You can check your reads quality ie with FastQC. 

ADD REPLYlink written 4.6 years ago by Leszek4.0k
6
gravatar for Mikael Huss
4.6 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

You could try FRCbam, ALE or REAPR. All of these are supposed to evaluate assemblies without the need for a reference genome. However, my experience with them is quite limited.

ADD COMMENTlink written 4.6 years ago by Mikael Huss4.6k

Don't (yet) know about the other two, but FRCbam needs actually _two_ libraries, a paired-end (PE) library and a mate-paired (MP) library. It seems that the original poster only has a PE-library. Don't know if there are then any "hacks" to get FRCbam to work correctly on such data.

ADD REPLYlink written 4.6 years ago by cedric.laczny50
2
gravatar for lexnederbragt
4.6 years ago by
lexnederbragt1.2k
Oslo, Norway
lexnederbragt1.2k wrote:

First, the best assembly depends on your research question. Do you need just presence/absence of genes, or is this going to be the reference genome for a larger study?

Second, in addition to the other answers, you could do an annotation, and check which assembly seems to be more complete.

ADD COMMENTlink written 4.6 years ago by lexnederbragt1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour