Question: How assembler manages a 100kb duplication
0
gravatar for biotech
4.5 years ago by
biotech540
United States
biotech540 wrote:

I think I have a 100kb repeat in a genome. I want to prove it but don't know if assembly of Illumina data will merge duplicated region into just one.

assembly • 1.2k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by biotech540

What do you have right now? Only reads?

See the papers:

HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2515-7

Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4173024/

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by natasha.sernova3.8k

I detected it mapping reads back against assembled draft genome.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by biotech540
0
gravatar for harold.smith.tarheel
4.5 years ago by
United States
harold.smith.tarheel4.6k wrote:

In theory, a Moleculo-type library with Illumina sequencing could distinguish the two repeats. In practice, many users have had limited success with this approach.

Personally, I would recommend PacBio sequencing for an unambiguous answer. There are also a host of new (e.g., optical mapping) and old (Southern blotting) techniques better suited to your task than short-read sequencing.

ADD COMMENTlink written 4.5 years ago by harold.smith.tarheel4.6k

If it's a 100kb repeat, it's still longer than PacBio reads, so it will still be ambiguous if it's more than 2. You need reads longer than the region.

But yes, there are a lot of alternatives that are more reliable than Illumina sequencing for this application.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by igor11k

True, but the OP posited a single repeat and asked about assembly-based detection. For the case he cited, PacBio would work.

ADD REPLYlink written 4.5 years ago by harold.smith.tarheel4.6k

It didn't say single, just that it was a repeat. I assumed a repeat meant an unknown number. If it was just one, I would call it a duplication. Regardless, it was unclear to me.

ADD REPLYlink written 4.5 years ago by igor11k

I understood "a 100kb repeat" as singular: a (meaning one) repeat of 100kb. The sentence about "duplicated region" also (to me) implied one repeat. But I can see how it might be interpreted differently.

BTW, I would also call it a duplication to minimize confusion.

ADD REPLYlink written 4.5 years ago by harold.smith.tarheel4.6k

It's a single repeat. Duplication.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by biotech540

That may be so but repeats may still have sequence variation/internal rearrangements that will only become evident after you locate/investigate it.

In one of the answers above you are saying that this is a draft genome so how are you sure that there are indeed 2 copies? Is the draft reasonably "finished" (a single or a small number of contigs)?

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by GenoMax92k
0
gravatar for igor
4.5 years ago by
igor11k
United States
igor11k wrote:

The assembly will merge it into one. Then, when you align the reads back to the assembled genome, that region will have twice as much coverage. You can use one of many CNV analysis tools to detect that.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENTlink written 4.5 years ago by igor11k

The assembly will merge it into one.

Only if the repeat is perfect. For a large region such as this that is/seems unlikely. Having twice as much coverage for the regions that are common may work.

Wonder if doing some old fashioned combination restriction digestions may work if OP can find enzymes that cut sparingly in the region to first prove that there is indeed a repeat present.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by GenoMax92k

Good point. It's hard to say how perfect the repeat is based on the information given.

ADD REPLYlink written 4.5 years ago by igor11k

Your response is fortunate @igor. I detected it by doing the coverage analysis. What I want to reveal now is where the repeat is located in the genome and how it happened.

ADD REPLYlink written 4.5 years ago by biotech540

If you are able to do some additional sequencing then doing a SMRTcell (or two) of PacBio long insert libraries would help (unless the repeat is adjacent). Spanning reads hopefully will allow you to discriminate between the two copies.

ADD REPLYlink written 4.5 years ago by GenoMax92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour