Question: Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?
12
gravatar for 14134125465346445
6.2 years ago by
United Kingdom
141341254653464453.4k wrote:

Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.

fastq tools • 33k views
ADD COMMENTlink modified 2.6 years ago by Shicheng Guo7.4k • written 6.2 years ago by 141341254653464453.4k

The tool FastQC has a good guesser. Or use the following perl script: fastqFormatDetect.pl

Both base their results according to the characters encountered within the score line of the fastq file. It's well explained above or on the fastq wiki page.

ADD REPLYlink modified 2.5 years ago • written 3.0 years ago by Juke-342.1k

That link is too old and gives 404

ADD REPLYlink written 2.5 years ago by Xapple30

I'm looking for the new URL ... nevertheless I found a Github that had a copy of it. I modified the link accordingly.

ADD REPLYlink written 2.5 years ago by Juke-342.1k
11
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

brentp has a nice utility to do just that see https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py

See also this: Guessing the quality scale in FASTQ files

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Istvan Albert ♦♦ 80k
2

Thanks, that worked:

gunzip -c file.fastq.gz | awk 'NR % 4 == 0' | head -n 1000000 | python ./guess-encoding.py

ADD REPLYlink written 6.2 years ago by 141341254653464453.4k
2

note that you can just send -n 100000 as an argument to guess-encoding.py

ADD REPLYlink written 6.2 years ago by brentp23k

guess-encoding.py  need to be updated 

ADD REPLYlink written 3.9 years ago by Medhat8.2k

It seems guess-encoding.py has a misleading example, suggesting cut -f 5 instead of cut -f 11 to grab quality strings.

ADD REPLYlink written 11 months ago by johnsenkyle130
8
gravatar for Irsan
6.2 years ago by
Irsan6.8k
Amsterdam
Irsan6.8k wrote:

if the quality scores contain character 0 it is either Sanger phred+33 or Illumina 1.8+ phred+33. When they also contain the character J, it is Illumina 1.8+ phred 33, otherwise it is Sanger phred + 33.

When the quality scores do not contain 0, it is either Solexa +64, Illumina 1.3+ Phred+64, Illumina 1.5+ Phred+64.

Then it is Solexa +64 when it contains character =

It is Illumina 1.3 phred + 64 when it contains A

It is Illumina 1.5 phred +64 when it contains no A or =

Take a look at the wiki and try to understand the table

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Irsan6.8k
7
gravatar for Medhat
3.9 years ago by
Medhat8.2k
Texas
Medhat8.2k wrote:

head -n 40 file.fastq | awk '{if(NR%4==0) printf("%s",$0);}' |  od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding\!";}'

 

source

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Medhat8.2k
5
gravatar for toni
6.2 years ago by
toni2.1k
Lyon
toni2.1k wrote:

You can use this tool :

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

It has an internal automatic guesser.

T.

ADD COMMENTlink written 6.2 years ago by toni2.1k
3
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBMap as a little tool for this:

$ testformat.sh in=N0174.fq.gz
sanger    fastq    gz    interleaved    150bp

 

ADD COMMENTlink written 3.9 years ago by Brian Bushnell16k
2
gravatar for Gvj
6.2 years ago by
Gvj440
Netherlands
Gvj440 wrote:

If you are searching for a quick dirty method, then just grep for any Sanger or Phred64 unique character. You can find it http://en.wikipedia.org/wiki/FASTQ_format

grep Z filename # for Phred64 and make sure that the lines are not headers

ADD COMMENTlink written 6.2 years ago by Gvj440
1
gravatar for Shicheng Guo
2.6 years ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

Install BBMap and then use the following script:

Usage:  reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>

reformat.sh in=Indx07.read1.fq out=Indx07.read1.phred33.fq qin=64 qout=33
reformat.sh in=Indx07.read2.fq out=Indx07.read2.phred33.fq qin=64 qout=33
ADD COMMENTlink written 2.6 years ago by Shicheng Guo7.4k
1
gravatar for ando.kelli
17 months ago by
ando.kelli40
University of Tasmania
ando.kelli40 wrote:

Hey there, if you run FastQC you can see the quality format in the main output screen, in the section marked "Encoding"

ADD COMMENTlink modified 17 months ago • written 17 months ago by ando.kelli40
0
gravatar for n.caillou
3.0 years ago by
n.caillou0
n.caillou0 wrote:

As noted by medhat above, GNU od or hexdump can be used to convert the quality scores to their decimal value, so

 cat file.fq | awk 'NR%4==0' | tr -d '\n' | hexdump -v -e'/1 "%u\n"' | sort -nu

will display which (decimal) quality scores exist in your file.

According to brentp's "guess-encoding.py" script the possible ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).

ADD COMMENTlink modified 18 months ago • written 3.0 years ago by n.caillou0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2123 users visited in the last hour