Question

Assembly vs scaffolding

2

Entering edit mode

8.2 years ago

sardagno ▴ 20

I can't find a clear definition that differentiates sequence assembly vs scaffolding.

From my understanding,

Assembly = joining reads into contigs

Scaffolding = joining contigs into scaffolds (using eg paired-end reads)

Does that sound right? It seems that assembly must be followed by scaffolding, but definitions of assembly don't even talk about scaffolding. Can you do assemble a whole genome with just "assembly"?

sequencing Assembly • 6.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by sardagno ▴ 20

3

Entering edit mode

Assembly is not exactly "joining reads into contigs", but "creating contigs from reads", which is more general. Joining implies the reads are intact (which is sometimes true) but most modern assemblies break them into kmers first and don't actually join any reads.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by Brian Bushnell 20k

Ram · Accepted Answer · 2016-02-25

7

Entering edit mode

8.2 years ago

novice ★ 1.1k

Contigs are sequences of overlapping (contigous) reads. Paired-end (or mate-pair) reads can be used to determine the gap between two contigs. When you know the gap, you can make a scaffold, which is just the two contigs with Ns representing the gap in between.

EDIT: Can you do assemble a whole genome with just "assembly"?

Yes. Scaffolding won't give you more information about the actual bases anyways; it just tries to tell you how your contigs are ordered.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by novice ★ 1.1k

0

Entering edit mode

how we know the gap between contigs?

ADD REPLY • link 8.2 years ago by midox ▴ 290

1

Entering edit mode

You estimate them based on the insert size distribution of your reads.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

how we know the insert size distribution of the reads?? thanks

ADD REPLY • link 8.2 years ago by midox ▴ 290

2

Entering edit mode

That's not an easy question to answer. Sometimes, people assume it is the length according to the kit. For example, if you have some site-specific enzyme that's supposed to cut every 10kbp on average... then maybe you have a 10kbp library! Or, maybe not.

When possible, it's best to use mapping. If you generate contigs, then keep only the nice long contigs (>20kbp or so) and map to them, and you will get a good insert size distribution. The longer the contigs are with respect to your expected insert size, the less bias you will get, so the ">20kbp" thing actually varies. If you are scaffolding with short-insert reads of 200-400bp insert, then retaining all contigs over 1kbp would be fine.

But, what if all your contigs are shorter than your expected insert size? Then... who knows. Try mapping to a related species with a reference, perhaps. BBMerge has a kmer-based mode for merging nonoverlapping read pairs via assembly, which can be used for inferring insert sizes. It's more forgiving than assembly because it ignores some classes of branches. But, I've never tried it with really long inserts (>4kbp) and would not expect it to work all that well.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

how to doscaffolding in this case? From what you say it is difficult to know the size insert? Thanks

ADD REPLY • link 8.2 years ago by midox ▴ 290

1

Entering edit mode

It's not difficult, but it is data-dependent. What kind of library are you trying to use for scaffolding, and what is the length distribution of your contigs?

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

for me. I have contigs and I do not know exactly the length contigs but mybe their average length is about 300kbp (or plus). With these contigs I want to do a scaffolding. How to do in this case?? thanks

ADD REPLY • link 8.2 years ago by midox ▴ 290

0

Entering edit mode

...and what kind of read libraries do you have?

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

what does it mean "read libraries" ?? sorry

ADD REPLY • link 8.2 years ago by midox ▴ 290

1

Entering edit mode

A read library is a set of reads processed together (in the laboratory). To describe a library, you need to state:

What kind of input genetic material was used, what platform was your data sequenced on, how long are the reads, what is the expected insert size, how were they fragmented, what chemistry was used, etc. If you are not sure, then ask whoever sequenced the DNA; you have to know this before processing the data.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

from what I know, I have short reads sequenced from Illumina with 300bp lenght. and i have long reads.

ADD REPLY • link 8.2 years ago by midox ▴ 290

1

Entering edit mode

Ok... so, pick a program that does scaffolding, like sspace. Map your reads to your contigs to get the insert size distribution, or whatever it requires as an input. Then run the program according to its instructions (I've never used it, personally).

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Yes thanks. But I want to program a scaffolder. this is why I want to know how to do it scafolding because in the papers is not clear. That's why I need help. Thank you

ADD REPLY • link 8.1 years ago by midox ▴ 290

1

Entering edit mode

Basically... a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious. Once that is known, it is simple to condense the nodes into linear scaffolds.

This discussion ignores issues like sequencing errors and repeated sequences which make scaffolding difficult.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

0

Entering edit mode

"two contigs A and B are joined by an edge if one read maps to A and the other maps to B." here, the other what is it? the other pair of read?

and we not use the mate pairs reads?

Thanks

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.1 years ago by midox ▴ 290

1

Entering edit mode

The other read in the pair (I've clarified my response above). I'm not sure what your second question means, but this is how the mate pairs are used.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Yes, thank you for your clarification.

when you say "a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious".

Here we use just paired-end reads?

Sorry, I'm fuzzy on the scaffolding process.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 8.1 years ago by midox ▴ 290

0

Entering edit mode

"Here we use just paired-end reads?"

Yes, that's correct.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k