In quantitative transcriptome analysis, we often add synthetic RNA to the reaction for quality control and normalisation, and the External RNA Controls Consortium (ERCC) spikes are a popular choice, available commercially from Invitrogen (now Thermo Fisher).  Spike sequence is available from the vendor, or from the U.S. National Institute of Standards and Technology (NIST), which hosted the work of the ERCC.

I just realised that the spike sequences are often used inaccurately: what is provided is the sequence of the inserts which were cloned into plasmids which were used as templates for in vitro transcription of the RNA spikes.  Thus, some linker sequences are present at the 5′ ends of the spikes, and are not in the reference sequences provided on-line.  In RNA-seq experiments, this is usually not a visible problem, since it is rare that a read would start or end within the first ~10 bases of a RNA.  However, for methods focusing 5′ ends, such as CAGE (Cap Analysis Gene Expression), it is essential to correct the reference sequences.

I am providing a patch on GitHub Gist, to add the missing linker sequences.  (Since multiple reference files are provided, for instance with/without polyA tails, etc, I think that providing a single patch is simpler).  All spikes miss GG, from the T7 RNA polymerase promoter, and GAATTC, from a EcoRI cloning site.  Some spikes contain also a SacI (GAGCTC), and some contain a KpnI site (GGTACC) on top of this.  This information is available in the material certificate on the NIST website.

I hope that this patch can be useful to others.  If nobody reports issues with it, I will send an email to Thermo Fisher asking them if they would consider distributing patched spike sequences instead of the insert sequences as they do currently.

Edited: s/Sal/Sac/ (only the name was wrong; the sequence is correct).

Edited: The RNA spike sequences are now distributed by the NIST !

Following my enquiry, the maker declined to modify the reference sequences that are distributed on its website, since for RNA-seq the missing linker sequences are not a problem, but suggested that the decision might be reconsidered if more custommers use the spikes in applications where the missing linker sequences matter.

I contacted the NIST directly and I am very happy to announce that they are now distributing the sequence of the RNA spikes (putative T7 transcription products) on their website. I added the URL in the main text above. The file they distribute contains the 96 ERCC sequences; note that in the commercialised version only 92 spikes are used.

