biojava get qualityscores of sequence
1
1
Entering edit mode
4.9 years ago
t-jim ▴ 30

Hello,

I'm trying to parse a fastq file with biojava and I need to get the quality score for every base of each sequence. So far I got this:

        FastqReader fastqReader = new SangerFastqReader(); 
        List<DNASequence> sequences = new LinkedList<DNASequence>();
        File in = new File("fastqfile.fastq");
        fastqReader.read(in);
         for (Fastq fastq : fastqReader.read(in)) {
             DNASequence test = FastqTools.createDNASequenceWithQualityScores(fastq);
             sequences.add(test); 
         }
         for(DNASequence seq : sequences) {
            String sequence = seq.getSequenceAsString();
            /*get score sequence*/
         }

I looked through the API and I know that the score is stored as a QualityFeature in the DNASequence but I can't figure out how to get it. I would appreciate your help.

biojava java fastq • 1.2k views
ADD COMMENT
0
Entering edit mode
4.9 years ago

Maybe you are making things more complicated than they need in your code. Wouldn't this work?

FastqReader fastqReader = new SangerFastqReader(); 
File in = new File("fastqfile.fastq");
fastqReader.read(in);
for (Fastq fastq : fastqReader.read(in)) {
    String qual = fastq.getQuality();
    for(int i= 0; i < qual.length(); i++ ) {
        int q= (int)(qual.charAt(i)) - 33;
        System.err.println(q);
    }
}
ADD COMMENT
0
Entering edit mode

I have already tried that. This gives me the score in ASCII characters but it want them as numbers. I use the createDNASequenceWithQualityScores() methode because it returns a DNASequence object, converts the ASCII score into numbers and stores it in the object. I just need to figure out how to access the score.

ADD REPLY
0
Entering edit mode

See edit... Basically, convert ASCII to decimal and from there to quality score using the appropriate offset. Here I use -33 to produce Sanger scores. I don't think it is possible to always decide unambiguously, i.e. automatically, what offset should be used although after reading a few sequencing one should be able to tell what encoding has been used.

ADD REPLY

Login before adding your answer.

Traffic: 2407 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6