Question: How does long reads help in the repeated regions of the genome
0
7 days ago by
Ashi10
United States
Ashi10 wrote:

Hi,

I am doing some work with long reads and I have read in many review papers what are advantages and disadvantages of long reads over short reads. One of the most important advantage of Long reads that I have come across is that they help in the assembly of the highly repeated regions whereas short reads fail to do that.

Is there any example or an article which will help me to understand (basically visualize) how long reads are helping in these repeated regions?

Thank you so much for all the help.

modified 6 days ago by Jorge Amigo12k • written 7 days ago by Ashi10
4
7 days ago by
Mensur Dlakic6.5k
USA
Mensur Dlakic6.5k wrote:

When individual reads are shorter than the repeats, and especially when repeats are highly similar, it is not possible to unambiguously map the reads onto genome. That means it is not possible to unambiguously determine the number of repeats.

You can do this as an exercise: make a schematic representation of 5 repeats, and place 50 random reads shorter than repeats across that region. Let's say that this results in 8x coverage. If you did that exercise for a non-repetitive region, you'd be able to reconstruct the whole sequence without much problem. The problem with repeats is that read #1 may be from repeat #1, but you'll be able to map it to any of the 5 repeats if they are similar enough. Even reads that span two repeats will map to more than one part of the genome, again assuming that reads and their intervening sequences are similar enough. This will usually lead to the shrinkage of the repetitive region, and we will end up with fewer than 5 repeats.

The problem is not limited to individual genomic regions. If there are repeats in two physically separate parts of the genome, the assembler will entangle them when resolving the graph nodes. That will lead to the collapse of both contigs, and the repeats may be assigned to either of them or even end up in a separate contig. That's illustrated below: red and cyan are different contigs that share a repeated part in the middle.

Google `contig collapse repeat` and there should plenty of material that will hopefully explain it to your liking.

Long reads overcome this problem if they are long enough to go through the whole repeated regions. If a long read starts in a unique part of the genome, goes through the repeat and ends up also in another unique part, there is nothing ambiguous in how that read can be overlapped with others because the whole repeated region is contained within that long read. Even when long reads can't go through the whole repeated region, they are usually better in resolving repeats as long as they catch several repeats that may have one or two unique mutations. That greatly reduces the number of possibilities during the long-read overlap compared to short reads.

Thank you for the explanation. It is really helpful.

2
6 days ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

In short, long reads are more capable of containing the entire repeated region, therefore it can be read as a whole instead of being reconstructed by trying to map short reads without enough anchoring points at both sides of the problematic region.

Short reads coming from a repeated region will map with equal probability in several places of that region, therefore the mapping quality of those reads will be 0 and they won't be considered in any downstream analysis.