Question

genome assembly - size of gaps

0

Entering edit mode

2.9 years ago

MH85 ▴ 10

Hi everyone,

INTRO: I have genome assembly obtained from linked reads (10X Genomics) + ONT long reads. The initial assembly was done in Supernova, gaps filling and scaffolding were done in PBJelly. This was done by outsourcing, but the company did not provide any information about the gaps.

PROBLEM: The size of gaps (Ns) varies in the assembly from 10 to 100,000. I need to specify for the submission how the sizes of gaps were estimated, and what number stays for unknown gap size.

QUESTIONS:

Is there any general procedure/rule for estimation of gap size during this kind of assembly?
I am especially wondering about gaps with rough numbers like 10, 100, 5000, 100000, etc. What these stand for? Do these represent known size or do they stand for the unknown gap size?

NOTE: Asking the company is not the best way, as this analysis was outsourced three years ago and the company does not communicate much smooth these days.

Thanks a lot in advance Milos

PBJelly assembly_genome linked_reads gaps • 1.6k views

ADD COMMENT • link updated 2.9 years ago by lieven.sterck 15k • written 2.9 years ago by MH85 ▴ 10

0

Entering edit mode

where are you submitting to? ENA? NCBI?

most commonly the standard is to use a stretch of 100 Ns for gaps of unknown size, and the actual gap size for the ones that can be estimated.

gap size estimates are usually done with the use of paired-end/mate pair read data (or read length) ...

ADD REPLY • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you.

I am submitting it to DDBJ. Generally, I understand that unknown gaps should be 100Ns. I guess that the company providing NGS services should be aware of that as well. But, does it mean that all other gaps are of known sizes? Difficult to say, right?! For example, there are 15,366 10Ns gaps and 131 5,000Ns gaps in the assembly. Isn't there any general role in how the gap size is treated in PBJelly?

ADD REPLY • link 2.9 years ago by MH85 ▴ 10

1

Entering edit mode

I don't know the specifics of PBJelly but what is key to report are the gaps introduced by scaffolding your sequences. There the gap estimates are usually actual estimates. Those gaps of 10N are likely within contigs and those are less of an issue (gap estimate is more accurate and has only little effect on the overal genome structure/size

ADD REPLY • link 2.9 years ago by lieven.sterck 15k

score 2 · Accepted Answer · 2021-06-08

2

Entering edit mode

2.9 years ago

Mensur Dlakic ★ 27k

NOTE: Asking the company is not the best way, as this analysis was outsourced three years ago and the company does not communicate much smooth these days.

Asking the company is the best way. Even if you have to wait a week, a month, that information is still likely to be the most reliable. It is likely that they know what procedure was used three years ago. In fact, there is a good chance that they are still using the same procedure.

ADD COMMENT • link 2.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

The company answered my emails and clarified the issue. Simply speaking, gap location and size were estimated during the steps of assembly and scaffolding in Supernova assembler. Thus, all gaps should be of the known sizes.

Thank you both for your answers.

ADD REPLY • link 2.9 years ago by MH85 ▴ 10

0

Entering edit mode

that is not 100% correct but if it would work than it's OK

ADD REPLY • link 2.9 years ago by lieven.sterck 15k