Is there a way to estimate how large a gap might be after performing 454 pyrosequencing followed by Newbler? I have several closed reference genomes, and I know that my read length is about 400bp with about 40x coverage. Therefore I have high confidence that these gaps are due to repeat regions.
edit I guess this might be answerable by knowing the repeat regions in a genome. How would I identify repeat regions just by the sequence alone? If I knew this length then I would take the repeat region length, L and calculate it by gapLength = L-(2 * 400).
I think that even with 40x coverage, you're not guaranteed to have reads covering all gaps, and the theoretical models don't work so well in practice. I don't know any exact numbers for this (and it probably varies from run to run and lab to lab), but coverage tends to be uneven, and there could be features of the sequence that makes some parts rare or unsequenceable. It's well known that you get duplicated clones (the same clone on multiple beads), which is one form of unevenness.
I assume you have shotgun reads only? For newbler assemblies, you can actually find the repeats among the contigs by looking at the per-contig read depth. With apologies for the self-promotion, here is a paper describing just that: http://www.hindawi.com/journals/seq/2010/782465.html. Contigs with higher-than-normal read depth are collapsed repeats, and the depth is proportional to the copy number.
This will at least tell you what (contigs) the repeats are. Looking at the 454ContigGraph file could tell you which contigs the 'neighbours' of the repeats are.
Is it an eukariotic or bacterial genome?