Question: Changing Quality Scores Within Bam Files
3
gravatar for matted
6.6 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

I have a set of bam files that include reads with quality scores in a mix of incompatible formats. This includes Solexa, Illumina 1.5, and Sanger encoding. The reads were aligned with bwa, so as far as I understand the misspecified quality scores didn't affect the alignment and were just copied through to the final bam file.

Now I'd like to call variants with these datasets using samtools mpileup, but I am stuck because the quality scores in all input files need to have the same encoding. That is, I am aware of the -6 flag to mpileup that would work if all the samples were in Illumina 1.5 format, but it's not applicable since I have a mix.

So my question is: can I do anything better than the brute force approaches of (1) fixing the original fastq files and realigning or (2) mucking through the bam files and changing quality scores myself with e.g. pysam?

fastq bam samtools mpileup • 5.9k views
ADD COMMENTlink written 6.6 years ago by matted7.0k

Maybe merging the three data sets isn't a good idea, I think that you cannot improve your variant calling because each technology has particular bias in errors and coverages. I suggest you to analyze each set independently and report the combination of high quality calls in each data set.

ADD REPLYlink written 6.6 years ago by JC7.6k
1

Point well taken, and I'll be on the lookout for that. In this instance, the mixing is more a function of data source (SRA with Sanger quality scores versus in-house sequencing with Illumina 1.5 quality scores) than technology (but certainly there are center-specific effects that might be at play).

ADD REPLYlink written 6.6 years ago by matted7.0k
1
gravatar for Istvan Albert
6.6 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

You wouldn't need to go through pysam. Just convert the BAM to SAM then rescale the 11th column and convert back to BAM. You won't need to even sort again probably not even index, though the latter is usually not that time consuming.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Istvan Albert ♦♦ 79k

Thanks, this is basically what I ended up doing. Since the bams were compressed, I did need to re-index however... the changed quality scores can mildly affect compression and change file offsets.

ADD REPLYlink written 6.6 years ago by matted7.0k
1
gravatar for Lee Katz
6.6 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

I'm not positive, but you can probably use samtools merge. It'd be easy to tell if it works after looking at IGV or Tablet.

But then again, is it is that hard to write a script that parses the quality scores column and adds or subtracts 33? Here's my untested/pseudo code:

samtools view file.bam | \
perl -lane 'if('/^#/){print; next;}  
  @qual=split //, $F[$qualIndex]; $_=chr(ord($_)+33) for(@qual); $F[$qualIndex]=join("",@qual); print join("\t",@F)' | \
samtools -bS > file.corrected.bam
ADD COMMENTlink written 6.6 years ago by Lee Katz2.9k

the principle is correct but it will be more than just adding 33 - they will first need to detect from the read id which format the data is in then convert this encoded value to a numerical quality value and then re-encode that into Sanger encoding.

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 79k

Right, thank you for clarifying. The OP already knows which BAM file has which encoding, and so the one-liner will have to be run for each file that he/she wants to convert. The end result would be a set of BAM files with one encoding scheme for quality scores.

ADD REPLYlink written 6.6 years ago by Lee Katz2.9k
2

Thanks, that's correct. I targeted a specific re-scaling to each bam. samtools merge won't change the quality scores so it doesn't help. Just a quick clarification to your example, which is basically what I did, is to output the header so that the conversion back to bam works:

samtools view -h in.bam | rescale_quals.py | samtools view -bS - > out.bam
ADD REPLYlink written 6.6 years ago by matted7.0k

Could you share this rescale_quals.py script? I am in a similar position. Thank!

ADD REPLYlink written 6.4 years ago by sahiilseth30

Hi, where could I find the rescale_quals.py script? I am in the same position. Thanks a lot.

ADD REPLYlink written 5.6 years ago by jiw090030
1

Here it is (sadly I'm too lazy to post to github or anything better...):

#!/usr/bin/python
import sys

def process(fname):
    fin = open(fname)
    for line in fin:
        if line[0] == '@':
            print line,
            continue
        line = line.strip().split("\t")
        if len(line[10]) != len(line[9]):
            print >>sys.stderr, "number of quality scores does not match number of bases: [%s] vs. [%s]" % (line[9], line[10])
            line[10] = 'O' * len(line[9])
        line[10] = "".join([chr(ord(c)-31) for c in line[10]])
        print "\t".join(line)
    fin.close()

if __name__ == "__main__":
    if len(sys.argv) <= 1:
        print >>sys.stderr, "usage (converts fastq 1.3/1.5 to sanger): samtools view -h in.bam | bam_rescale_quals.py - | samtools view -bS - > out.bam"
    else:
        if sys.argv[1] == "-":
            sys.argv[1] = "/dev/stdin"
        process(sys.argv[1])
ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by matted7.0k

Cool. Thanks a lot.

ADD REPLYlink written 5.6 years ago by jiw090030

Just to clarify, this script converts the Illumina1.5+ (PHRED64+) to Sanger encoding (Phred33+), right?

ADD REPLYlink written 3.3 years ago by lilepisorus30

"usage (converts fastq 1.3/1.5 to sanger): samtools view -h in.bam | bam_rescale_quals.py - | samtools view -bS - > out.bam"

So I guess, yea.

ADD REPLYlink written 2.3 years ago by jfo10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1250 users visited in the last hour