Sequencing PHRED quality confusion
3
0
Entering edit mode
4.1 years ago
piyushjo ▴ 700

Hi,

I am processing some patient data that was generated a long time ago. Some of the files from the same sample have two different fastqs. When I ran fastqc to identify adapter content, I observed that some files are labelled as Illumina1.5 while others are labelled as Illumina 1.9 (from the same sample, from different runs).As for trimming I have to specify the phred quality in trim_galore and also HISAT2 required a flag (I haven't seen major changes when I don't put that flag, but I don't want to ignore when I have the information), I want to make sure the phred quality is really 64 or not.

However, when I look at the read quality per base. I am doubtful if the reads are really Illumina 1.5, or are just being called such.

Example. This file is being called as Illumina 1.5, but the per base quality looks similar to typical Illumina 1.9. Is this really Illumina 1.5?

Slide1

The next two are being called as Illumina 1.9, but per base quality is almost on average. Slide2 Slide3

Is there another way that can give me a better idea? Can I just read the top of the file and that will give me a better answer? Or shall I trust the report from FastQC?

Thanks!

fastqc illumina 1.5 illumina 1.9 • 1.7k views
ADD COMMENT
3
Entering edit mode
4.1 years ago
GenoMax 141k

You can use the following program from BBMap suite to get an idea of the encoding.

$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp

If this data is old (5+ yr) then it may be in Illumina (Phred + 64) format.

ADD COMMENT
3
Entering edit mode
4.1 years ago
Juke34 8.5k

From GAAS you can use gaas_fastq_guessMyFormat.pl

ADD COMMENT
1
Entering edit mode
4.1 years ago

What I would do is look at the quality values int the FASTQ files.

https://en.wikipedia.org/wiki/FASTQ_format

specifically at this line:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ.....................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

it shows you what character values are valid.

For example if you have lower case qualities abc... but no characters in the :;<=>?@A range then you would have an Illumina 1.5 variant.

in all fairness, there are really only two encodings, +33 and +64. Within the +64 variants, those will decode correctly because only the shift matters.

Obviously there could be distributions of values where you can't quite establish which one of the overlapping formats your data is in.

It would be curious though if the data from the same source switched encodings.

ADD COMMENT

Login before adding your answer.

Traffic: 3222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6