Tutorial: Transforming And Manipulating Color Space Reads
23
gravatar for Istvan Albert
5.6 years ago by
Istvan Albert ♦♦ 75k
University Park, USA
Istvan Albert ♦♦ 75k wrote:

This document is part of the lecture series on Next Generation Sequencing offered by the Bioinformatics Consulting Center at Penn State

Sequencing instruments that operate in color space formats produce the sequencing data in the so called 2-Base Encoding see official documentation and Wikipedia entry. In essence this means that a sequence:

AACTA

will be represented as transitions between nucleotides 0, 1, 2, 3 (also referred to as colors blue=0, green=1, yellow=2 and red=3) The table for the decoding is the following:

AA, CC, GG, TT : 0
AC, CA, GT, TG : 1
AG, CT, GA, TC : 2
AT, CG, GC, TA : 3

Therefore five base long sequence AACTA will be represented as 0123. The encoding is AA=0, AC=1, CT=2, and TA=3.

Properties

Identity transformations are labeled with 0, complementary transitions are labeled with 3, non-complementary transitions are designated with 1 and 2.

  • the complement sequence TTGAT encodes to the same colors 0123
  • the reverse sequence ATCAA encodes to the colors in reverse order 3210
  • the reverse complement TAGTT also encode to colors in reversed order 3210

From this is obvious that colorspace representation cannot be uniquely decoded and has four alternative representations.

To decode a sequence we would need to know the identity of the first base, then using that we can decode the rest. In our example if we knew that the first base was A then for example A0123 would decode as follows A followed by 0 specifies to AA. Then the AA followed by a 1 decodes to AAC and so on to form AACTA

At the time of writing this document most color space instruments (ABI 5500 Series Genetic Analysis Systems) produced two files per sample. One is a so called color space fasta .csfasta the other is a quality file that contains the Phred quality scores_ .qual. The color space Fasta specification is the following::

>853_7_463_F3
T3231110.122321002.0011.0012.2213..2
>853_17_1660_F3
T20201030313112312100020202032020120

The transitions that the instrument was unable to detect will be replaced with . (dot). The first base above is the so called primer and was added during the library prepration step. therefore it is not part of the original sample. We may use this primer to transform our colorspace data into letterspace format. The corresponding color space quality format::

>853_7_463_F3
33 32 29 33 33 31 33 -1 33 33 33 31 26 31 32 33 30 -1 27 31 33 33 -1 29 31 31 30 -1 28 29 18 16 -1 -1 28 
>853_17_1660_F3
24 26 31 29 22 27 31 32 31 22 26 30 24 25 33 30 26 30 29 22 29 33 30 26 10 17 24 27 27 24 26 15 7 15 32

Although not explicitely specified as such in our experience we can safely assume that both sequences and their corresponding qualities are stored on a single line. This makes processing the files a lot easier. Please also note that the number of items in the quality line is one fewer than the number of bases as the quality measures reflect transitions.

Important

Note that representations such as T0123 are a mixture of letterspace and colorspace formats. The handy properties of the color space described above will only apply to the color space part of the sequence! Moreover these properties are only valid when using tools and techniques that operate fully in color space and ignore both the first base and first color! of the sequence.

Once you make use of the properties to complement or reverse in color space the meaning of the first base is lost and the sequence may not directly be decoded into letter space anymore!

If you absolutely must access the full reverse complement of a sequence in mixed representation you will need to first decode in letter space then reverse/complement the letter space sequence and reconvert it into colorspace.

Example: if you have a color space aware tool then the colorspace reverse complement of T0123 will be T321. But if you have a non colorspace aware tool then you will first need to transform T0123 into TGAT, reverse complement it into ATCA and encode it a primer (say T) as T3321.

Transformations

Some tools can operate on color space files directly, others require certain transformations. There is some confusion in the area as different transformations are often referred to with the identical terms. We'll try to clarify these below. There are numerous high performance tools to perform these transformation we hope other people will point them out in answers and comments. Some tools provide their own converters make sure use those.

Note for the sake of completeness we also provide reference implementations in Python_ to each of the transformations below. Please see the cslib.py file that contains a functions that can perform each of the transformations below. For example usecases see the test_cslib.py file. Each transformation below is implemented as a function call in the test file. A later post points to the NGS plumbing python library that offers more color space data manipulation.

We'll be starting with a file that contains:

>853_7_463_F3
T3231110.122321002.0011.0012.2213..2
>853_17_1660_F3
T20201030313112312100020202032020120

Color space Fasta to letter space Fasta

This transformation simply decodes the colors in the sequence into base (letter) space using the transition tables. Caveats: once an unknown color . (dot) is seen the remainder of the sequences becomes undetermined. More importantly a single miscalled color will cause a incorrect values in for the rest of the sequence::

>853_7_463_F3
AGCACAANNNNNNNNNNNNNNNNNNNNNNNNNNNN
>853_17_1660_F3
CCTTGGCCGTACAGCAGTTTTCCTTCCGAAGGTCC

Color space Fasta to double-encoded Fasta

TheVelvet assembler requires the data to be in this format. Usually accomplished via the solid_denovo_preprocessor.pl.

This is a format trims the primer base and the first color and then replaces the rest of the colors by a simple transformation 0123 --> ACGT :

>853_7_463_F3
GTCCCANCGGTGCAAGNAACCNAACGNGGCTNNG
>853_17_1660_F3
AGACATATCTCCGTCGCAAAGAGAGATGAGACGA

Reverse complement color space Fasta

This operation is usually necessary if the orientation of the reads is not in the more commonly observed forward + reverse format. Some tools only recognize some type of orientations thus we may need to change the orientation of the second pair in forward + forward pairings:

>853_7_463_F3
T2..3122.2100.1100.200123221.0111323
>853_17_1660_F3
T02102023020202000121321131303010202

Remember that the data above cannot be decoded into valid letterspace representation! If this latter is necessary then you will need a full color --> letter --> color roundtrip conversion ::

>853_7_463_F3
T.............................011132
>853_17_1660_F3
T10210202302020200012132113130301020

Note how this operations leads to substantial information loss.

Color space Fasta to color space Fastq

Note: the solid2fastq program in the bfast aligner source directory produces this conversion.

This is a format that merges the two color space data files into a a FastQ like format. The sequence will be kept in color space but the quality measures are reformatted in FastQ encoding. Note that since the primer is included the length of the sequence is one base longer than the length of the quality scores::

@853_7_463_F3
T3231110.122321002.0011.0012.2213..2
+
BA>BB@B!BBB@;@AB?!<@BB!>@@?!=>31!!=
@853_17_1660_F3
T20201030313112312100020202032020120
+
9;@>7<@A@7;?9:B?;?>7>B?;+29<<9;0(0A

Color space Fasta to double encoded Fastq

Note: the solid2fastq.pl program in the bwa aligner source repository performs this transformation.

This format combines the double encoding with quality measures of the FastQ_ quality formatting::

>853_7_463_F3
GTCCCANCGGTGCAAGNAACCNAACGNGGCTNNG
+
A>BB@B!BBB@;@AB?!<@BB!>@@?!=>31!!=
>853_17_1660_F3
AGACATATCTCCGTCGCAAAGAGAGATGAGACGA
+
;@>7<@A@7;?9:B?;?>7>B?;+29<<9;0(0A

Color space Fasta to Fastq

This format applies both the letter space and the quality conversion::

@853_7_463_F3
AGCACAANNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
BA>BB@B!BBB@;@AB?!<@BB!>@@?!=>31!!=
@853_17_1660_F3
CCTTGGCCGTACAGCAGTTTTCCTTCCGAAGGTCC
+
9;@>7<@A@7;?9:B?;?>7>B?;+29<<9;0(0A
tutorial solid conversion • 14k views
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Istvan Albert ♦♦ 75k
1

Excellent tutorial, now color space is not a mystery to me anymore! thanks!

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by dfernan610

Thanks for this nice piece of information.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by KJ Lim110

Hi,

Thanks for interesting post. However I wanted to know how would we convert fastq files which are stored in SRA and are in the ABI Solid format. They do not have any accompanying color space file. Kindly let me know how to go about it.

ADD REPLYlink written 4.4 years ago by skm770140

you should ask this as a separate question and post a few lines from the file - I am guessing it is in a colorspace fastq or some variant. It all depends on these details.

ADD REPLYlink written 4.4 years ago by Istvan Albert ♦♦ 75k

Hi Istvan,

Here's the post for the above question : ABI Solid fastq files quality control and further analysis

ADD REPLYlink written 4.4 years ago by skm770140

Hi Istvan,

The links to the cslib.py and test_cslib.py are no longer active. Could you please correct them ?

Thanks!

ADD REPLYlink written 2.3 years ago by multicode0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1424 users visited in the last hour