Question: Fastq files with integer instead of acii quality scores
0
gravatar for bgbrink
2.6 years ago by
bgbrink60
bgbrink60 wrote:

I was going to align a bunch of old fastq files with bwa and got no result. When I looked into the files, I saw that the base quality is reported as integers as opposed to ascii:

@1_21_9:1:2:1565:591
GTGTTGTTTAGAAGCTGAACTACCTTTTTCGCTGAG
+1_21_9:1:2:1565:591
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 31 5 40 40 1 40 15 40 40 40 40 40 4 2 40 40 15 1 39
@1_21_9:1:2:1307:745
GATCGGAAGAGCTCGTCTGCCGTCTTCTGCTTTGCT
+1_21_9:1:2:1307:745
 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 -2 1 1 1

Has anyone ever seen this encoding before and knows a tool that can convert this into proper fastq?

Note that there are negative values as well. Could this be old Solexa quality scores?

sequencing • 2.3k views
ADD COMMENTlink modified 2.3 years ago by liartom210 • written 2.6 years ago by bgbrink60

That file does not meet fastq format definition. Where did you get this data BTW? Do you know what technology is it from?

ADD REPLYlink written 2.6 years ago by genomax91k

I have seen GAIIx data that was in separate sequence and score (as integers) files. Maybe somebody just mashed them together without knowing that they need to be encoded...

ADD REPLYlink written 2.6 years ago by cschu1812.5k

That could be it. I don't have any hard proof from what technology this data is from though. Does it still make sense to try and convert the scores manually?

ADD REPLYlink written 2.6 years ago by bgbrink60

If you have a clue which encoding/phred scale is used you could convert it to a sane fastq, using some scripting. Alternatively you could just convert it to a fasta file and forget about the quality scores...

ADD REPLYlink written 2.6 years ago by WouterDeCoster44k
3
gravatar for sacha
2.6 years ago by
sacha2.0k
France
sacha2.0k wrote:

It seems you are using Solexa+64 encoding ( -5 to 40 ). You can convert to ASCII easily helped by the following picture. enter image description here

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by sacha2.0k
2
gravatar for sacha
2.6 years ago by
sacha2.0k
France
sacha2.0k wrote:

I did it for you with awk :

cat myfile.fastq | awk -f convert.awk 

// convert.awk 
function toascii(score)
{
    return sprintf("%c",score + 64)
}


(NR-1) % 4 == 0{
print $0
}

(NR-1) % 4 == 1{
print $0
}

(NR-1) % 4 == 2{
print "+"
}

(NR-1) % 4 == 3{

for (i=1; i <= NF ; i+=1)
    {
        printf(toascii($i))
    }
    printf("\n")
}
ADD COMMENTlink modified 2.2 years ago • written 2.6 years ago by sacha2.0k
1
gravatar for liartom2
2.3 years ago by
liartom210
liartom210 wrote:

for (i=1; i < NF ; i+=1)

i <= NF, my dude

ADD COMMENTlink written 2.3 years ago by liartom210
2

Hi liartom2 ,

This reply is better suited as a comment on sacha's answer. Could you make the appropriate change please? That would involve the following steps:

Thank you!

ADD REPLYlink written 2.3 years ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2142 users visited in the last hour