genome assembly - size of gaps
1
0
Entering edit mode
2.9 years ago
MH85 ▴ 10

Hi everyone,

INTRO: I have genome assembly obtained from linked reads (10X Genomics) + ONT long reads. The initial assembly was done in Supernova, gaps filling and scaffolding were done in PBJelly. This was done by outsourcing, but the company did not provide any information about the gaps.

PROBLEM: The size of gaps (Ns) varies in the assembly from 10 to 100,000. I need to specify for the submission how the sizes of gaps were estimated, and what number stays for unknown gap size.

QUESTIONS:

  1. Is there any general procedure/rule for estimation of gap size during this kind of assembly?
  2. I am especially wondering about gaps with rough numbers like 10, 100, 5000, 100000, etc. What these stand for? Do these represent known size or do they stand for the unknown gap size?

NOTE: Asking the company is not the best way, as this analysis was outsourced three years ago and the company does not communicate much smooth these days.

Thanks a lot in advance Milos

PBJelly assembly_genome linked_reads gaps • 1.6k views
ADD COMMENT
0
Entering edit mode

where are you submitting to? ENA? NCBI?

most commonly the standard is to use a stretch of 100 Ns for gaps of unknown size, and the actual gap size for the ones that can be estimated.

gap size estimates are usually done with the use of paired-end/mate pair read data (or read length) ...

ADD REPLY
0
Entering edit mode

Thank you.

I am submitting it to DDBJ. Generally, I understand that unknown gaps should be 100Ns. I guess that the company providing NGS services should be aware of that as well. But, does it mean that all other gaps are of known sizes? Difficult to say, right?! For example, there are 15,366 10Ns gaps and 131 5,000Ns gaps in the assembly. Isn't there any general role in how the gap size is treated in PBJelly?

ADD REPLY
1
Entering edit mode

I don't know the specifics of PBJelly but what is key to report are the gaps introduced by scaffolding your sequences. There the gap estimates are usually actual estimates. Those gaps of 10N are likely within contigs and those are less of an issue (gap estimate is more accurate and has only little effect on the overal genome structure/size

ADD REPLY
2
Entering edit mode
2.9 years ago
Mensur Dlakic ★ 27k

NOTE: Asking the company is not the best way, as this analysis was outsourced three years ago and the company does not communicate much smooth these days.

Asking the company is the best way. Even if you have to wait a week, a month, that information is still likely to be the most reliable. It is likely that they know what procedure was used three years ago. In fact, there is a good chance that they are still using the same procedure.

ADD COMMENT
0
Entering edit mode

The company answered my emails and clarified the issue. Simply speaking, gap location and size were estimated during the steps of assembly and scaffolding in Supernova assembler. Thus, all gaps should be of the known sizes.

Thank you both for your answers.

ADD REPLY
0
Entering edit mode

that is not 100% correct but if it would work than it's OK

ADD REPLY

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6