How assembler manages a 100kb duplication
2
0
Entering edit mode
5.2 years ago
biotech ▴ 540

I think I have a 100kb repeat in a genome. I want to prove it but don't know if assembly of Illumina data will merge duplicated region into just one.

Assembly • 1.3k views
ADD COMMENT
0
Entering edit mode

What do you have right now? Only reads?

See the papers:

HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2515-7

Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4173024/

ADD REPLY
0
Entering edit mode

I detected it mapping reads back against assembled draft genome.

ADD REPLY
0
Entering edit mode
5.2 years ago

In theory, a Moleculo-type library with Illumina sequencing could distinguish the two repeats. In practice, many users have had limited success with this approach.

Personally, I would recommend PacBio sequencing for an unambiguous answer. There are also a host of new (e.g., optical mapping) and old (Southern blotting) techniques better suited to your task than short-read sequencing.

ADD COMMENT
0
Entering edit mode

If it's a 100kb repeat, it's still longer than PacBio reads, so it will still be ambiguous if it's more than 2. You need reads longer than the region.

But yes, there are a lot of alternatives that are more reliable than Illumina sequencing for this application.

ADD REPLY
0
Entering edit mode

True, but the OP posited a single repeat and asked about assembly-based detection. For the case he cited, PacBio would work.

ADD REPLY
0
Entering edit mode

It didn't say single, just that it was a repeat. I assumed a repeat meant an unknown number. If it was just one, I would call it a duplication. Regardless, it was unclear to me.

ADD REPLY
0
Entering edit mode

I understood "a 100kb repeat" as singular: a (meaning one) repeat of 100kb. The sentence about "duplicated region" also (to me) implied one repeat. But I can see how it might be interpreted differently.

BTW, I would also call it a duplication to minimize confusion.

ADD REPLY
0
Entering edit mode

It's a single repeat. Duplication.

ADD REPLY
0
Entering edit mode

That may be so but repeats may still have sequence variation/internal rearrangements that will only become evident after you locate/investigate it.

In one of the answers above you are saying that this is a draft genome so how are you sure that there are indeed 2 copies? Is the draft reasonably "finished" (a single or a small number of contigs)?

ADD REPLY
0
Entering edit mode
5.2 years ago
igor 12k

The assembly will merge it into one. Then, when you align the reads back to the assembled genome, that region will have twice as much coverage. You can use one of many CNV analysis tools to detect that.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENT
0
Entering edit mode

The assembly will merge it into one.

Only if the repeat is perfect. For a large region such as this that is/seems unlikely. Having twice as much coverage for the regions that are common may work.

Wonder if doing some old fashioned combination restriction digestions may work if OP can find enzymes that cut sparingly in the region to first prove that there is indeed a repeat present.

ADD REPLY
0
Entering edit mode

Good point. It's hard to say how perfect the repeat is based on the information given.

ADD REPLY
0
Entering edit mode

Your response is fortunate @igor. I detected it by doing the coverage analysis. What I want to reveal now is where the repeat is located in the genome and how it happened.

ADD REPLY
0
Entering edit mode

If you are able to do some additional sequencing then doing a SMRTcell (or two) of PacBio long insert libraries would help (unless the repeat is adjacent). Spanning reads hopefully will allow you to discriminate between the two copies.

ADD REPLY

Login before adding your answer.

Traffic: 1139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6