Question: How To Control Wgsim Base Qualities?
gravatar for Pascal
6.6 years ago by
Pascal1.4k wrote:


I'm trying to make GATK UnifiedGenotyper to detect indels inserted by wgsim short reads simulator.

Broad employees just informed me that indels are not called because wgsim creates reads of Q17 and everything below 20 is thrown away.

I checked the fastq files, and indeed, all qualities are "2" which is if I understood well the ASCII translation corresponds to 17.

Is there a way to make wgsim output reads with a configurable base quality?

I thought that a trick could be to replace these "2222...2" strings by somthing like "????...?" Or is there a more elegant way to proceed?


gatk fastq • 3.1k views
ADD COMMENTlink modified 4.2 years ago by Zhaorong1.2k • written 6.6 years ago by Pascal1.4k
gravatar for Istvan Albert
6.6 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

You could use the following awk script to change your 2s to Is:

awk ' /222222222/ { gsub("2", "I"); print $0; next } { print } '
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Istvan Albert ♦♦ 77k

Double thanks! I will try both.

ADD REPLYlink written 6.6 years ago by Pascal1.4k

Double thanks Sir! I will try both solutions...

ADD REPLYlink written 6.6 years ago by Pascal1.4k

Wait a minute: I actually used the wgsim available at and this produces only "2" qualities ?!

ADD REPLYlink written 6.6 years ago by Pascal1.4k

my mistake - I have a file like that I thought I generated it with wgsim - you are correct it does put in 2s. I will correct the answer

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 77k

No problem. Thanks for answering. I just awked my fastq files with your file and it is doing alignment now :->... I think I will move to MAQ as it looks it offers more control. Cheers.

ADD REPLYlink written 6.6 years ago by Pascal1.4k
gravatar for jeff.hammerbacher
5.3 years ago by
United States
jeff.hammerbacher90 wrote:

I forked wgsim and made the quality score configurable using the -Q command line argument. I've made I the default value for the quality score (Phred quality score of 40). Feel free to use my fork at to alleviate the need for the post-processing step with awk.

Previously the quality score was scaled with the error rate, so that as the error rate went up, the quality score went down. The precise formula used is (-10.0 * log(ERR_RATE) / log(10.0) + 0.499) + 33 (see Because ERR_RATE is set to 0.02 by default, the quality score comes out to (-10.0 * log(0.02) / log(10.0) + 0.499) + 33 = 50.4887..., which is encoded using the character 2, which has the ASCII character 50.

ADD COMMENTlink written 5.3 years ago by jeff.hammerbacher90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour