Question

How assembler manages a 100kb duplication

0

Entering edit mode

7.9 years ago

biotech ▴ 570

I think I have a 100kb repeat in a genome. I want to prove it but don't know if assembly of Illumina data will merge duplicated region into just one.

Assembly • 2.1k views

ADD COMMENT • link 7.9 years ago by biotech ▴ 570

0

Entering edit mode

What do you have right now? Only reads?

See the papers:

HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2515-7

Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4173024/

ADD REPLY • link 7.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I detected it mapping reads back against assembled draft genome.

ADD REPLY • link 7.9 years ago by biotech ▴ 570

score 0 · Answer 1 · 2016-06-08

0

Entering edit mode

7.9 years ago

harold.smith.tarheel ★ 4.9k

In theory, a Moleculo-type library with Illumina sequencing could distinguish the two repeats. In practice, many users have had limited success with this approach.

Personally, I would recommend PacBio sequencing for an unambiguous answer. There are also a host of new (e.g., optical mapping) and old (Southern blotting) techniques better suited to your task than short-read sequencing.

ADD COMMENT • link 7.9 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

If it's a 100kb repeat, it's still longer than PacBio reads, so it will still be ambiguous if it's more than 2. You need reads longer than the region.

But yes, there are a lot of alternatives that are more reliable than Illumina sequencing for this application.

ADD REPLY • link 7.9 years ago by igor 13k

0

Entering edit mode

True, but the OP posited a single repeat and asked about assembly-based detection. For the case he cited, PacBio would work.

ADD REPLY • link 7.9 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

It didn't say single, just that it was a repeat. I assumed a repeat meant an unknown number. If it was just one, I would call it a duplication. Regardless, it was unclear to me.

ADD REPLY • link 7.9 years ago by igor 13k

0

Entering edit mode

I understood "a 100kb repeat" as singular: a (meaning one) repeat of 100kb. The sentence about "duplicated region" also (to me) implied one repeat. But I can see how it might be interpreted differently.

BTW, I would also call it a duplication to minimize confusion.

ADD REPLY • link 7.9 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

It's a single repeat. Duplication.

ADD REPLY • link 7.9 years ago by biotech ▴ 570

0

Entering edit mode

That may be so but repeats may still have sequence variation/internal rearrangements that will only become evident after you locate/investigate it.

In one of the answers above you are saying that this is a draft genome so how are you sure that there are indeed 2 copies? Is the draft reasonably "finished" (a single or a small number of contigs)?

ADD REPLY • link 7.9 years ago by GenoMax 141k

score 0 · Answer 2 · 2016-06-08

0

Entering edit mode

7.9 years ago

igor 13k

The assembly will merge it into one. Then, when you align the reads back to the assembled genome, that region will have twice as much coverage. You can use one of many CNV analysis tools to detect that.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENT • link 7.9 years ago by igor 13k

0

Entering edit mode

The assembly will merge it into one.

Only if the repeat is perfect. For a large region such as this that is/seems unlikely. Having twice as much coverage for the regions that are common may work.

Wonder if doing some old fashioned combination restriction digestions may work if OP can find enzymes that cut sparingly in the region to first prove that there is indeed a repeat present.

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

Good point. It's hard to say how perfect the repeat is based on the information given.

ADD REPLY • link 7.9 years ago by igor 13k

0

Entering edit mode

Your response is fortunate @igor. I detected it by doing the coverage analysis. What I want to reveal now is where the repeat is located in the genome and how it happened.

ADD REPLY • link 7.9 years ago by biotech ▴ 570

0

Entering edit mode

If you are able to do some additional sequencing then doing a SMRTcell (or two) of PacBio long insert libraries would help (unless the repeat is adjacent). Spanning reads hopefully will allow you to discriminate between the two copies.

ADD REPLY • link 7.9 years ago by GenoMax 141k