Question

reference-guided genome assembly: stretches of Ns

1

Entering edit mode

7.1 years ago

User 4014 ▴ 40

Hi folks,

I have sequences of fungal isolates from the same species produced with HiSeq 2500 (2x150 bp with 350 bp insert). These isolates show different levels of pathogenicity so basicly I want to compare variations in their genome. So far I tried reference-guided genome assembly by mapping each isolate back to an available reference genome and calling a consensus sequence of such isolates with SAMtools. However, I got long stretches of n in some contigs. May I have your opinion how can I solve this problem? I have already tried changing a couple of aligners (BWA and Bowtie2) without success.

Also I was thinking if it would help to do de novo assembly using spades first and map contigs back to the reference genome? I appreciate all opinions and suggestions.

Many thanks in advance and have a great day!

Assembly genome alignment next-gen • 2.5k views

ADD COMMENT • link updated 7.1 years ago by JC 13k • written 7.1 years ago by User 4014 ▴ 40

score 2 · Answer 1 · 2017-04-05

Everytime you use an assembler with short reads, you get many different contigs and not a unique linear and continuous DNA sequence.

In other words, you get gaps between the contigs. Places of your genome that have not been either sequenced or assembled. Places of your assembly whose DNA sequence remains unknown.

However, since you are assembling with a reference genome as a guide, spades can recognize where your contigs and your gaps are located.

And after it places the contigs in order using the information of your reference genome, it knows the distance between contigs, and the length of the gaps. This is a reference genome-assisted scaffolding

And then, the assembler fill out the gaps with N, because it does not know the bases of those gaps.

The N you see in a contig, is not actually within a contig. These N are actually within scaffolds

score 0 · Answer 2 · 2017-04-05

In general, those uncalled regions are created when you have paired-ends reads and regions between pairs are empty (no reads means no coverage), so the assembler is just reporting N's. You can as you said trying to do a de novo assembly and check if those regions are deletions (you can also see that if you check for deletions using the reads mapping information), if not, you maybe need to increase your coverage doing more sequencing.