Question: Assembly vs scaffolding
0
gravatar for sardagno
2.6 years ago by
sardagno0
sardagno0 wrote:

I can't find a clear definition that differentiates sequence assembly vs scaffolding.

From my understanding,

Assembly = joining reads into contigs

Scaffolding = joining contigs into scaffolds (using eg paired-end reads)

Does that sound right?  It seems that assembly must be followed by scaffolding, but definitions of assembly don't even talk about scaffolding.  Can you do assemble a whole genome with just "assembly"?

sequencing assembly • 2.3k views
ADD COMMENTlink modified 2.6 years ago by novice830 • written 2.6 years ago by sardagno0
3

Assembly is not exactly "joining reads into contigs", but "creating contigs from reads", which is more general. Joining implies the reads are intact (which is sometimes true) but most modern assemblies break them into kmers first and don't actually join any reads.

ADD REPLYlink modified 8 days ago by Ram17k • written 2.6 years ago by Brian Bushnell15k
7
gravatar for novice
2.6 years ago by
novice830
United States
novice830 wrote:

Contigs are sequences of overlapping (contigous) reads. Paired-end (or mate-pair) reads can be used to determine the gap between two contigs. When you know the gap, you can make a scaffold, which is just the two contigs with Ns representing the gap in between.

EDIT: Can you do assemble a whole genome with just "assembly"?

Yes. Scaffolding won't give you more information about the actual bases anyways; it just tries to tell you how your contigs are ordered.

ADD COMMENTlink modified 8 days ago by Ram17k • written 2.6 years ago by novice830

how we know the gap between contigs?

ADD REPLYlink written 2.6 years ago by midox190
1

You estimate them based on the insert size distribution of your reads.

ADD REPLYlink written 2.6 years ago by Brian Bushnell15k

how we know the insert size distribution of the reads?? thanks

ADD REPLYlink written 2.6 years ago by midox190
2

That's not an easy question to answer. Sometimes, people assume it is the length according to the kit. For example, if you have some site-specific enzyme that's supposed to cut every 10kbp on average... then maybe you have a 10kbp library! Or, maybe not.

When possible, it's best to use mapping. If you generate contigs, then keep only the nice long contigs (>20kbp or so) and map to them, and you will get a good insert size distribution. The longer the contigs are with respect to your expected insert size, the less bias you will get, so the ">20kbp" thing actually varies. If you are scaffolding with short-insert reads of 200-400bp insert, then retaining all contigs over 1kbp would be fine.

But, what if all your contigs are shorter than your expected insert size? Then... who knows. Try mapping to a related species with a reference, perhaps. BBMerge has a kmer-based mode for merging nonoverlapping read pairs via assembly, which can be used for inferring insert sizes. It's more forgiving than assembly because it ignores some classes of branches. But, I've never tried it with really long inserts (>4kbp) and would not expect it to work all that well.

ADD REPLYlink written 2.6 years ago by Brian Bushnell15k

how to doscaffolding in this case? From what you say it is difficult to know the size insert? Thanks

ADD REPLYlink written 2.6 years ago by midox190
1

It's not difficult, but it is data-dependent. What kind of library are you trying to use for scaffolding, and what is the length distribution of your contigs?

ADD REPLYlink written 2.6 years ago by Brian Bushnell15k

for me. I have contigs and I do not know exactly the length contigs but mybe their average length is about 300kbp (or plus). With these contigs I want to do a scaffolding. How to do in this case?? thanks

ADD REPLYlink written 2.6 years ago by midox190

...and what kind of read libraries do you have?

ADD REPLYlink written 2.6 years ago by Brian Bushnell15k

what does it mean "read libraries" ?? sorry

ADD REPLYlink written 2.6 years ago by midox190
1

A read library is a set of reads processed together (in the laboratory). To describe a library, you need to state:

What kind of input genetic material was used, what platform was your data sequenced on, how long are the reads, what is the expected insert size, how were they fragmented, what chemistry was used, etc. If you are not sure, then ask whoever sequenced the DNA; you have to know this before processing the data.

ADD REPLYlink written 2.6 years ago by Brian Bushnell15k

from what I know, I have short reads sequenced from Illumina with 300bp lenght. and i have long reads.

ADD REPLYlink written 2.5 years ago by midox190
1

Ok... so, pick a program that does scaffolding, like sspace. Map your reads to your contigs to get the insert size distribution, or whatever it requires as an input. Then run the program according to its instructions (I've never used it, personally).

ADD REPLYlink written 2.5 years ago by Brian Bushnell15k

Yes thanks. But I want to program a scaffolder. this is why I want to know how to do it scafolding because in the papers is not clear. That's why I need help. Thank you

ADD REPLYlink written 2.5 years ago by midox190
1

Basically... a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious. Once that is known, it is simple to condense the nodes into linear scaffolds.

This discussion ignores issues like sequencing errors and repeated sequences which make scaffolding difficult.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell15k

"two contigs A and B are joined by an edge if one read maps to A and the other maps to B." here, the other what is it? the other pair of read?

and we not use the mate pairs reads?

Thanks

ADD REPLYlink modified 8 days ago by Ram17k • written 2.5 years ago by midox190
1

The other read in the pair (I've clarified my response above). I'm not sure what your second question means, but this is how the mate pairs are used.

ADD REPLYlink written 2.5 years ago by Brian Bushnell15k

Yes, thank you for your clarification.

when you say "a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious".

Here we use just paired-end reads?

Sorry, I'm fuzzy on the scaffolding process.

ADD REPLYlink modified 8 days ago by Ram17k • written 2.5 years ago by midox190

"Here we use just paired-end reads?"

Yes, that's correct.

ADD REPLYlink written 2.5 years ago by Brian Bushnell15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour