Question

Sequencing PHRED quality confusion

0

Entering edit mode

4.1 years ago

piyushjo ▴ 700

Hi,

I am processing some patient data that was generated a long time ago. Some of the files from the same sample have two different fastqs. When I ran fastqc to identify adapter content, I observed that some files are labelled as Illumina1.5 while others are labelled as Illumina 1.9 (from the same sample, from different runs).As for trimming I have to specify the phred quality in trim_galore and also HISAT2 required a flag (I haven't seen major changes when I don't put that flag, but I don't want to ignore when I have the information), I want to make sure the phred quality is really 64 or not.

However, when I look at the read quality per base. I am doubtful if the reads are really Illumina 1.5, or are just being called such.

Example. This file is being called as Illumina 1.5, but the per base quality looks similar to typical Illumina 1.9. Is this really Illumina 1.5?

The next two are being called as Illumina 1.9, but per base quality is almost on average.

Is there another way that can give me a better idea? Can I just read the top of the file and that will give me a better answer? Or shall I trust the report from FastQC?

Thanks!

fastqc illumina 1.5 illumina 1.9 • 1.7k views

ADD COMMENT • link updated 4.1 years ago by Istvan Albert 100k • written 4.1 years ago by piyushjo ▴ 700

1

Entering edit mode

4.1 years ago

Istvan Albert 100k

What I would do is look at the quality values int the FASTQ files.

https://en.wikipedia.org/wiki/FASTQ_format

specifically at this line:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ.....................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

it shows you what character values are valid.

For example if you have lower case qualities abc... but no characters in the :;<=>?@A range then you would have an Illumina 1.5 variant.

in all fairness, there are really only two encodings, +33 and +64. Within the +64 variants, those will decode correctly because only the shift matters.

Obviously there could be distributions of values where you can't quite establish which one of the overlapping formats your data is in.

It would be curious though if the data from the same source switched encodings.

ADD COMMENT • link 4.1 years ago by Istvan Albert 100k

score 3 · Accepted Answer · 2020-03-23

3

Entering edit mode

4.1 years ago

GenoMax 141k

You can use the following program from BBMap suite to get an idea of the encoding.

$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp

If this data is old (5+ yr) then it may be in Illumina (Phred + 64) format.

ADD COMMENT • link 4.1 years ago by GenoMax 141k

score 3 · Accepted Answer · 2020-03-23

3

Entering edit mode

4.1 years ago

Juke34 8.5k

From GAAS you can use gaas_fastq_guessMyFormat.pl

ADD COMMENT • link 4.1 years ago by Juke34 8.5k