Question: Sequencing PHRED quality confusion
0
gravatar for piyushjo
10 months ago by
piyushjo550
piyushjo550 wrote:

Hi,

I am processing some patient data that was generated a long time ago. Some of the files from the same sample have two different fastqs. When I ran fastqc to identify adapter content, I observed that some files are labelled as Illumina1.5 while others are labelled as Illumina 1.9 (from the same sample, from different runs).As for trimming I have to specify the phred quality in trim_galore and also HISAT2 required a flag (I haven't seen major changes when I don't put that flag, but I don't want to ignore when I have the information), I want to make sure the phred quality is really 64 or not.

However, when I look at the read quality per base. I am doubtful if the reads are really Illumina 1.5, or are just being called such.

Example. This file is being called as Illumina 1.5, but the per base quality looks similar to typical Illumina 1.9. Is this really Illumina 1.5?

Slide1

The next two are being called as Illumina 1.9, but per base quality is almost on average. Slide2 Slide3

Is there another way that can give me a better idea? Can I just read the top of the file and that will give me a better answer? Or shall I trust the report from FastQC?

Thanks!

ADD COMMENTlink modified 10 months ago by Istvan Albert ♦♦ 86k • written 10 months ago by piyushjo550
3
gravatar for GenoMax
10 months ago by
GenoMax94k
United States
GenoMax94k wrote:

You can use the following program from BBMap suite to get an idea of the encoding.

$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp

If this data is old (5+ yr) then it may be in Illumina (Phred + 64) format.

ADD COMMENTlink modified 10 months ago • written 10 months ago by GenoMax94k
3
gravatar for Juke34
10 months ago by
Juke345.0k
Sweden
Juke345.0k wrote:

From GAAS you can use gaas_fastq_guessMyFormat.pl

ADD COMMENTlink written 10 months ago by Juke345.0k
1
gravatar for Istvan Albert
10 months ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

What I would do is look at the quality values int the FASTQ files.

https://en.wikipedia.org/wiki/FASTQ_format

specifically at this line:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ.....................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

it shows you what character values are valid.

For example if you have lower case qualities abc... but no characters in the :;<=>?@A range then you would have an Illumina 1.5 variant.

in all fairness, there are really only two encodings, +33 and +64. Within the +64 variants, those will decode correctly because only the shift matters.

Obviously there could be distributions of values where you can't quite establish which one of the overlapping formats your data is in.

It would be curious though if the data from the same source switched encodings.

ADD COMMENTlink written 10 months ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2493 users visited in the last hour
_