This post is a follow-up to the question asked by gerrybio2010 How To Find Repeat Expansion Using Exome/Genome Sequencing Data? on short tandem repeats/trinucleotide repeats/etc detection in NGS data. Not everyone may find this interesting, but in the world of gene discovery in human neurologic disease, this is really important.
I've been talking about this to people much smarter than me for a while now, and I'm still a bit unsure as to the core of the issue when it comes to the detection of STRs/etc by next-gen techniques, so I wanted to poll people's understanding here as well. Some say it's a sequence capture issue (and tied into the amplification bias inherent in whole exome capture protocols...whole genome sequencing ought to alleviate these) and others say it's mainly an analysis issue for the obvious read-length and repeat reasons. But interestingly people in other labs I've talked to don't agree on this.
So is the difficulty detecting short tandem repeats due to:
1) Issues with these regions during library prep (for whole exome) -- i.e. poor amplification of GC-rich repeats
2) Issues with mapping - can't align repeats uniquely, and the repeats may be longer than the read length
3) A swampy combination of both of the above
4) No one really knows, this is not a low-hanging-fruit kind of a problem, hopefully someone else will really start working on it.
My impression reading the literature is that short tandem repeat regions are by-and-large captured during library prep, but the issue is one of analysis of the sequence data and identifying them. For example, see Kozlowski et al.
These blog posts are also interesting in this discussion:
So second question: Has anyone else used lobSTR on control data known to harbor expanded STRs? Were those regions detected?
And a final question: I like the idea of calling something by its absence in sequence data -- perhaps to flag regions for follow-up studies in the wet lab. It should be possible to detect regions where -- given good overall read depth -- reads are missing, and may be indicative of STR and STR expansion. Does anyone have any experience with this sort of analysis of their data?