Question: reference-guided genome assembly: stretches of Ns
gravatar for User 4014
3.5 years ago by
User 401440
User 401440 wrote:

Hi folks,

I have sequences of fungal isolates from the same species produced with HiSeq 2500 (2x150 bp with 350 bp insert). These isolates show different levels of pathogenicity so basicly I want to compare variations in their genome. So far I tried reference-guided genome assembly by mapping each isolate back to an available reference genome and calling a consensus sequence of such isolates with SAMtools. However, I got long stretches of n in some contigs. May I have your opinion how can I solve this problem? I have already tried changing a couple of aligners (BWA and Bowtie2) without success.

Also I was thinking if it would help to do de novo assembly using spades first and map contigs back to the reference genome? I appreciate all opinions and suggestions.

Many thanks in advance and have a great day!

ADD COMMENTlink modified 3.5 years ago by JC11k • written 3.5 years ago by User 401440
gravatar for Antonio R. Franco
3.5 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

Everytime you use an assembler with short reads, you get many different contigs and not a unique linear and continuous DNA sequence.

In other words, you get gaps between the contigs. Places of your genome that have not been either sequenced or assembled. Places of your assembly whose DNA sequence remains unknown.

However, since you are assembling with a reference genome as a guide, spades can recognize where your contigs and your gaps are located.

And after it places the contigs in order using the information of your reference genome, it knows the distance between contigs, and the length of the gaps. This is a reference genome-assisted scaffolding

And then, the assembler fill out the gaps with N, because it does not know the bases of those gaps.

The N you see in a contig, is not actually within a contig. These N are actually within scaffolds

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Antonio R. Franco4.5k
gravatar for JC
3.5 years ago by
JC11k wrote:

In general, those uncalled regions are created when you have paired-ends reads and regions between pairs are empty (no reads means no coverage), so the assembler is just reporting N's. You can as you said trying to do a de novo assembly and check if those regions are deletions (you can also see that if you check for deletions using the reads mapping information), if not, you maybe need to increase your coverage doing more sequencing.

ADD COMMENTlink written 3.5 years ago by JC11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 785 users visited in the last hour