Question: How does long reads help in the repeated regions of the genome
0
gravatar for Ashi
7 days ago by
Ashi10
United States
Ashi10 wrote:

Hi,

I am doing some work with long reads and I have read in many review papers what are advantages and disadvantages of long reads over short reads. One of the most important advantage of Long reads that I have come across is that they help in the assembly of the highly repeated regions whereas short reads fail to do that.

Is there any example or an article which will help me to understand (basically visualize) how long reads are helping in these repeated regions?

Thank you so much for all the help.

ADD COMMENTlink modified 6 days ago by Jorge Amigo12k • written 7 days ago by Ashi10
4
gravatar for Mensur Dlakic
7 days ago by
Mensur Dlakic6.5k
USA
Mensur Dlakic6.5k wrote:

When individual reads are shorter than the repeats, and especially when repeats are highly similar, it is not possible to unambiguously map the reads onto genome. That means it is not possible to unambiguously determine the number of repeats.

You can do this as an exercise: make a schematic representation of 5 repeats, and place 50 random reads shorter than repeats across that region. Let's say that this results in 8x coverage. If you did that exercise for a non-repetitive region, you'd be able to reconstruct the whole sequence without much problem. The problem with repeats is that read #1 may be from repeat #1, but you'll be able to map it to any of the 5 repeats if they are similar enough. Even reads that span two repeats will map to more than one part of the genome, again assuming that reads and their intervening sequences are similar enough. This will usually lead to the shrinkage of the repetitive region, and we will end up with fewer than 5 repeats.

The problem is not limited to individual genomic regions. If there are repeats in two physically separate parts of the genome, the assembler will entangle them when resolving the graph nodes. That will lead to the collapse of both contigs, and the repeats may be assigned to either of them or even end up in a separate contig. That's illustrated below: red and cyan are different contigs that share a repeated part in the middle.

enter image description here

Google contig collapse repeat and there should plenty of material that will hopefully explain it to your liking.

Long reads overcome this problem if they are long enough to go through the whole repeated regions. If a long read starts in a unique part of the genome, goes through the repeat and ends up also in another unique part, there is nothing ambiguous in how that read can be overlapped with others because the whole repeated region is contained within that long read. Even when long reads can't go through the whole repeated region, they are usually better in resolving repeats as long as they catch several repeats that may have one or two unique mutations. That greatly reduces the number of possibilities during the long-read overlap compared to short reads.

ADD COMMENTlink written 7 days ago by Mensur Dlakic6.5k

Thank you for the explanation. It is really helpful.

ADD REPLYlink written 6 days ago by Ashi10
2
gravatar for Jorge Amigo
6 days ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

In short, long reads are more capable of containing the entire repeated region, therefore it can be read as a whole instead of being reconstructed by trying to map short reads without enough anchoring points at both sides of the problematic region.

Short reads coming from a repeated region will map with equal probability in several places of that region, therefore the mapping quality of those reads will be 0 and they won't be considered in any downstream analysis.

ADD COMMENTlink modified 6 days ago • written 6 days ago by Jorge Amigo12k

Thank you for this simple explanation. I get the full picture now.

ADD REPLYlink written 6 days ago by Ashi10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1145 users visited in the last hour