What Does The Community Think Of "Sequence-Specific Error Profile Of Illumina Sequencers"
6
12
Entering edit mode
12.9 years ago
Nick Loman ▴ 610

I just finished reading "Sequence-specific error profile of Illumina sequencers" and I found it extremely interesting. It suggests that Illumina data suffers from quite serious systematic errors which may even affect SNP calls. I believe this will come as a surprise to the sequencing community. We are comfortable with 454 data having systematic homopolymeric tract problems, and with Illumina having high GC% issues, but I don't believe this particular issue has been described before.

I need to read it a few more times before I summarise my reaction, I will reply to my own question when I have.

http://nar.oxfordjournals.org/content/early/2011/05/14/nar.gkr344.full

What do you think about this paper? Do you believe there truly are such systemic issues with Illumina data? Is there any other explanation for the observed results?

illumina error next-gen sequencing • 8.5k views
ADD COMMENT
0
Entering edit mode

Good topic. Community wiki?

ADD REPLY
0
Entering edit mode

Yes. See also related post.

ADD REPLY
7
Entering edit mode
12.9 years ago

I was recently at the Cold Spring Harbor Biology of Genomes conference. This topic was presented by Meromit Singer (UC Berkeley). She talked about how this seemed to be unique to Illumina reads. Coincidentally, in the audience was a representative of Illumina who responded that the company knows about this, other reads from other systems could see something similar, and importantly, Illumina will have a fix in the chemistry behind this problem in a short time.

So, yes, after hearing the talk, I do believe that there are such systematic errors in Illumina data. I feel they are small, based on the CSHL talk I heard. No concrete alternative explanation was given.

(I don't have such data to analyze first-hand and so my opinion comes from what I observed at CSHL last week.)

ADD COMMENT
5
Entering edit mode
12.9 years ago
Nick Loman ▴ 610

Bastien Chevreux just pointed me to the following resource!

http://chevreux.org/GGCxG_problem.html

So it seems at least this part of it is not new ...

ADD COMMENT
3
Entering edit mode
12.9 years ago

Illumina is pushing an update of their chemistry that could improve these issues. Eliott Margulies (NHGRI) had the chance to analyse the results using this update for human samples, and the results he presented at the Genomics of Rare Diseases meeting in Hinxton looked much better. The update basically gets rid of the GC-bias problems at the level of coverage when mapping, which I am guessing has to do with the fact that there are fewer GGCxG issues in there, at least partially.

ADD COMMENT
3
Entering edit mode
12.9 years ago
Bach ▴ 550

Nice paper and it was about time that this appeared in a respected journal.

However, I wonder whether the use of Google or other search engines has fallen into discrace with both authors and reviewers. Try searching any of the following terms on Google, Bing, Yahoo, ...

"solexa ggc" or "illumina ggc" or "illumina ggc motif" (all of these even without quotes)

If it's not the top hit itself, then it's in the top 5: a link to either a discussion on GGC or GGCxG motif on the SeqAnswers board in 2009 or, even better, a direct link to the chapter on Illumina sequence assembly in the MIRA documentation on SourceForge (see here) which talks about exactly these issues (complete with screenshots on assemblies affected by this).

And that's been documented since 2009/2010 and MIRA has parameters turning on routines which minimise the impact of these things on SNP calling.

Now, if someone publishes a paper on how the data from Illumina between Q3 2009 and the advent of TrueSeq kits showed a strong bias in coverage which is dependent on GC content again without even acknowledging the MIRA documentation, I'll start to weep in the corner.

ADD COMMENT
1
Entering edit mode
12.9 years ago
Mary 11k

Actually, I'm mildly suspicious of all of them. I remember a talk at ASHG a couple of years back that included an analysis of several platforms, and the concordance among them was much less than we would like to see. Can't find my notes on that right now. Will keep looking.

So when I'm using any of the data I like to see multiple occurrences of a SNP via different projects, for example. I'm not suggesting I'd dismiss novel SNPs out of hand. But I wouldn't bet the rest of my career on one without double-checking.

The technology will continue to get better and more reliable--and certainly has since I heard that talk. But some of the early stuff I'm particularly wary of.

ADD COMMENT
0
Entering edit mode
8.2 years ago
Sunguk • 0

In my study, the results were slightly different.

You may be interested in my study.

Shin, Sunguk, and Joonhong Park. "Characterization of sequence-specific errors in various next-generation sequencing systems." Molecular BioSystems(2016).

The major finding was that high GC contents/G or C homopolymers induce substitution errors in Illumina platforms.

As a good evidence, simply, I often observe higher GC contents in reads after Q20 filtering than Q30 (~0.3-0.5% higher).

In an environmental sample, removal of reads containing G/C homopolymers (>9 bp) increased average length of contigs and quality of SNPs.

If you want to read the pdf file, send an E-mail to bestfa@naver.com

ADD COMMENT

Login before adding your answer.

Traffic: 2346 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6