Question: Analysis Of Short Tandem Repeats/Trinucleotides In Ngs Data -- Revisited
gravatar for Alex Paciorkowski
5.4 years ago by
Rochester, NY USA
Alex Paciorkowski3.3k wrote:

This post is a follow-up to the question asked by gerrybio2010 How To Find Repeat Expansion Using Exome/Genome Sequencing Data? on short tandem repeats/trinucleotide repeats/etc detection in NGS data. Not everyone may find this interesting, but in the world of gene discovery in human neurologic disease, this is really important.

I've been talking about this to people much smarter than me for a while now, and I'm still a bit unsure as to the core of the issue when it comes to the detection of STRs/etc by next-gen techniques, so I wanted to poll people's understanding here as well. Some say it's a sequence capture issue (and tied into the amplification bias inherent in whole exome capture protocols...whole genome sequencing ought to alleviate these) and others say it's mainly an analysis issue for the obvious read-length and repeat reasons. But interestingly people in other labs I've talked to don't agree on this.

So is the difficulty detecting short tandem repeats due to:

1) Issues with these regions during library prep (for whole exome) -- i.e. poor amplification of GC-rich repeats

2) Issues with mapping - can't align repeats uniquely, and the repeats may be longer than the read length

3) A swampy combination of both of the above

4) No one really knows, this is not a low-hanging-fruit kind of a problem, hopefully someone else will really start working on it.

My impression reading the literature is that short tandem repeat regions are by-and-large captured during library prep, but the issue is one of analysis of the sequence data and identifying them. For example, see Kozlowski et al.

These blog posts are also interesting in this discussion:

So second question: Has anyone else used lobSTR on control data known to harbor expanded STRs? Were those regions detected?

And a final question: I like the idea of calling something by its absence in sequence data -- perhaps to flag regions for follow-up studies in the wet lab. It should be possible to detect regions where -- given good overall read depth -- reads are missing, and may be indicative of STR and STR expansion. Does anyone have any experience with this sort of analysis of their data?

exome genome analysis • 4.3k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 5.4 years ago by Alex Paciorkowski3.3k

In my experience of analysing large viral genomes of 1.8Kb ;-) it has mainly been an issue with assembly. The problem has usually been that the di-nucleotide and tri-nucleotide repeat regions have usually been longer than the read length and consequently the assembly programs whether de-novo or reference based do not know how to bridge the repeat region. We usually resort to designing primers either side of the repeat and Sanger sequencing.  

ADD REPLYlink written 4.8 years ago by Joseph Hughes2.7k

It's an interesting question, and my sense is that, in the case of exome-based detection, the capture itself can be an issue because the probes are designed based on the reference, AND, in the case of HD, the repeats are very GC-rich. If the capture is poor (as is suggested by the blogs you cite), the detection signals (similar to SV detection) will be poor as well. Are you aware of any published exome of genome datasets from patients with any of these disorders?

ADD REPLYlink written 5.4 years ago by Aaronquinlan10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2033 users visited in the last hour