How To Determine The Version Used To Generate Solexa/Illumina Fastq Files?
6
11
Entering edit mode
11.5 years ago

The Geo database contains an abundance of raw sequence tag files in fastq format some of which are generated by Solexa/Illumina NGS. Since Solexa/Illumina decided to change their own standard from version 1.3, both of which are not compatible to the sanger format, there exist currently 3 separate fastq definitions. I was wondering if there is any (easy) way to determine, which version was actually used to generate the fastq file (in particular for Solexa/Illumina since the platform is usually stated). Thats something you have to provide to nearly any aligner so that piece of information seems rather valuable.

next-gen sequencing fastq • 15k views
0
Entering edit mode

You should check, but I believe that the NCBI SRA archive, which actually hosts the FASTQ (not NCBI GEO), is claiming to have converted the FASTQ files into sanger standard FASTQ. Once the data are there, the idea is that you needn't worry about the conversion as it is supposed to have been done.

7
Entering edit mode
11.5 years ago
Phis ★ 1.1k

Apparently, there's now 4 different fastq encodings, with a new Illumina 1.5+ one, which doesn't make your task easier. So just looking at the fastq files themselves, without any additional information specifying with which software they were generated (or without the possibility to contact the people who did it), I don't see a general mechanism for finding out, except for special cases:

[Edited for clarification in response to comment:]

If the quality scores contain characters in the range ASCII 33 - 58 -> can only be Sanger

If FastQ file is known to be from an Illumina/Solexa platform AND the quality scores contain characters in the range ASCII 59 - 63 -> can only be Solexa/Illumina 1.0

If ASCII characters 64 or 65 are used in quality scores -> cannot be Illumina 1.5+

2
Entering edit mode

Oh my god ... 4 encodings and three from Illumina without them adding a header specifying the concept applied. We should seriously punish Illumina for their repeated crimes against the bioinformatic community! A simple but effective measure would be to reject anything for publication that uses Illumina platforms ;)

0
Entering edit mode

Sounds like with the same principle, you could reject any project that used Microsoft Word or Excel. It seems to me that it should be trivial to parse the first lines of the fastq file and determine which version was used. (I agree that they should have add a line specifying it, but I am just saying)

0
Entering edit mode

That's not true: "If the quality scores contain characters in the range ASCII 59 - 63 -> can only be Solexa/Illumina 1.0"

It can be a Sanger FASTQ file with very good scores (e.g. a contig).

0
Entering edit mode

@Peter: you're right - I didn't make it clear I was talking about the non-Sanger encodings. I edited/expanded it to make it clearer.

7
Entering edit mode
11.5 years ago
Casbon ★ 3.2k

See Peter Cock's work on FastQ at http://github.com/biopython/biopython/blob/master/Bio/SeqIO/QualityIO.py

It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" for the more recent variant), as this cannot be detected reliably automatically.'

0
Entering edit mode

i agree it cannot be detected reliably but it does Bio.SeqIO throw errors when the values are off of the provided scale?

3
Entering edit mode
10.1 years ago
Marina Manrique ★ 1.3k

SolexaQA does that exactly (among many other fancy things), just type

solexaqa reads.fastq


and you will get the fastq format of the file: Illumina FASTQ format, Illumina pipeline 1.3+, Sanger FASTQ format, etc.

What I don't know is that if you need R installed for this functionality or it's not necessary.

HTH,

Marina

0
Entering edit mode

Thanks a lot Marina for that link! I used the subroutine "getformat" in my perl script and it works great. Now I my wrapper can call Stampy with the appropriate FASTQ format.

2
Entering edit mode
11.5 years ago

Someone should write a script that gives out likelihoods that a fastq file is encoded a certain way. At least that will help eliminate one of the encodings.

So if you see quality scores from B-a:

0% Sanger
80% Illumina (a good run)

0
Entering edit mode

very good idea. cant wait for the bioinformatics paper on the posterior probability taking all current GEO data and their creation time into account.

1
Entering edit mode
9.1 years ago
Dan ▴ 520

Hmm... Once I got 'get_format' working, it reports Sanger, which is what fastQValidator seems to be using (Phred).

Here is the code in case anyone else gets stuck converting it out of solexaqa:

#!/usr/bin/perl

use strict;
use warnings;

my $format = ""; # set regular expressions my$sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/; my$solexa_regexp = qr/[\;<=>\?]/;
my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\^\_\abcdefgh]/; my$all_regexp = qr/[\@ABCDEFGHI]/;

# set counters
my $sanger_counter = 0; my$solexa_counter = 0;
my $solill_counter = 0; my$i;
while(<>){
$i++; # retrieve qualities next unless$i % 4 eq 0;

#print;
chomp;

# check qualities
if( m/$sanger_regexp/ ){$sanger_counter = 1;
last;
}
if( m/$solexa_regexp/ ){$solexa_counter = 1;
}
if( m/$solill_regexp/ ){$solill_counter = 1;
}
}

# determine format
if( $sanger_counter ){$format = "sanger";
}
elsif( !$sanger_counter &&$solexa_counter ){
$format = "solexa"; } elsif( !$sanger_counter && !$solexa_counter &&$solill_counter ){
$format = "illumina"; } print "$format\n";

0
Entering edit mode
9.1 years ago
Dan ▴ 520

Assuming tool X expects version Y, what range of scores would you see given version Z?

I'm seeing results like the following from fastQValidator

Average Phred Quality by Read Index (starts at 0):
0       30.14
1       9.44
2       8.88
3       9.17
4       8.89
5       8.47
6       20.36
7       18.86
8       21.23
9       22.53
10      20.64
11      20.89
12      17.91
13      20.48
14      16.72
15      21.26
16      20.06
17      21.02
18      31.05
19      18.09
20      16.62
21      29.66
22      17.08
23      16.29
24      30.37
25      28.24
26      25.93
27      25.00
28      27.13
29      26.40
30      12.63
31      13.78
32      22.34
33      13.77
34      11.67
35      12.24
36      11.75
37      20.82
38      21.13
39      19.89
40      18.43
41      18.72

Overall Average Phred Quality = 19.19
Finished processing puke2 with 4000 lines containing 1000 sequences.
There were a total of 0 errors.
Returning: 0 : FASTQ_SUCCESS
`