What are the best approaches to evaluate a genome assembly using the 'intrinsic' data?
3
3
Entering edit mode
7.4 years ago
fhsantanna ▴ 570

I have assembled four bacterial genomes derived from MiSeq pair-ended sequencing data using the following steps:

1. Assembly using CLC Workbench;
3. Assembly using A5 pipeline;
4. Merging of the three assembles using CISA;
5. Quality check of the assemblies using QUAST.

For checking the misassemblies, QUAST relies on a reference genome. However, for most of my draft genomes, I do not have a proper reference genome (too much genome differences in relation to those deposited in Genbank).

So, I ask you. How could I validate the genome assembly using intrinsic data? For example, using read mapping, what are the criteria to correct some regions? What is the best software for this purpose?

Thanks

assembly genome validation • 5.7k views
3
Entering edit mode
7.4 years ago
Leszek 4.1k

I don't know any ad hoc solution. But you can try looking at:

• fraction of reads that aligned - if many reads didn't aligned you probably miss some regions in your assembly
• fraction of reads with concordant pairing (ie samtools flagstat) - if this is low, you have likely rearrangements or high genome fragmentation
• pairwise genome alignments (ie. nucmer or lastal) of your assemblies to check for large inconsistencies between them

It's always good to compare vs chromosomes of some relative species to check whether your assembly make sense.

0
Entering edit mode

Should I use corrected reads or brute ones? I have used the brute ones on the contigs and most of them were not mapped...

0
Entering edit mode

I use raw reads, as modern aligners are quite good at aligning even poor quality reads. If a lot of your reads fail to align, it doesn't necessarily mean your assembly is wrong. You can check your reads quality ie with FastQC.

6
Entering edit mode
7.4 years ago

You could try FRCbam, ALE or REAPR. All of these are supposed to evaluate assemblies without the need for a reference genome. However, my experience with them is quite limited.

0
Entering edit mode

Don't (yet) know about the other two, but FRCbam needs actually _two_ libraries, a paired-end (PE) library and a mate-paired (MP) library. It seems that the original poster only has a PE-library. Don't know if there are then any "hacks" to get FRCbam to work correctly on such data.

2
Entering edit mode
7.4 years ago
lexnederbragt ★ 1.3k

First, the best assembly depends on your research question. Do you need just presence/absence of genes, or is this going to be the reference genome for a larger study?

Second, in addition to the other answers, you could do an annotation, and check which assembly seems to be more complete.