Question: Filter low quality reads in bam files
gravatar for ammarsabir15
4.2 years ago by
ammarsabir1560 wrote:

I an trying to filter low quality reads in bam file using python's pysam. I have used the code from here.

I have modified this code a little and the whole code is shown below, but the code is not giving any bam file ,

import argparse,pysam,re,sys

def FilterReads(in_file, out_file):

    def read_ok(read):
          read_ok - reject reads with a low quality (<30) base call
          read - a PySam AlignedRead object
          returns: True if the read is ok
          if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
                 return False
                return True


bam_in = pysam.Samfile(in_file, 'rb')
bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)

out_count = 0
for read in bam_in.fetch():
    if read_ok(read):
        out_count += 1

print 'reads_written =', out_count


def GetArgs():
GetArgs - read the command line
returns - an input bam file name and teh output filtered bam file

def ParseArgs(parser):
    class Parser(argparse.ArgumentParser):
        def error(self, message):
            sys.stderr.write('error: %s\n' % message)

    parser = Parser(description='Calculate PhiX Context Specific Error Rates.')
    parser.add_argument('-b', '--bam_file',
                        help='Input Bam file.')
    parser.add_argument('-o', '--output_file',
                        help='Output Bam file.')
    return parser.parse_args()

parser = argparse.ArgumentParser()
args = ParseArgs(parser)

return args.bam_file, args.output_file

def Main():
        bam_file, output_file = GetArgs()
FilterReads(bam_file, output_file)

 if __name__ == '__main__':
pysam next-gen python • 2.7k views
ADD COMMENTlink modified 4.2 years ago by Brian Bushnell17k • written 4.2 years ago by ammarsabir1560

I think you need to explain in more detail what you are trying to accomplish. Are you trying to reject any read with any base that has a quality score under 30? If so, I suggest you rethink your approach, because I can't imagine a scenario where that's a good idea.

ADD REPLYlink written 4.2 years ago by Brian Bushnell17k

Thanks Bushnell, Yes I want to reject reads having quality score < 30. I want to do it using python's pysam. I you have a better approach please share it.

ADD REPLYlink written 4.2 years ago by ammarsabir1560

No, I don't have a better approach, because I think removing any read with any base under Q30 is a terrible idea and will lead to extreme sequence-specific coverage bias. I have written tools to remove reads with average quality (based on expected error rates from the quality scores) below a certain level, but that should be used conservatively to avoid sequence-specific bias. I cannot imagine a scenario where it would be a good idea to remove every read that has a single base below a certain quality score, so I won't suggest a way to do it unless you can explain why you want to do so. I suggest you quality-trim the reads and reject reads with length under a specified value prior to mapping. You can do so like this, with the BBMap package (assuming interleaved reads): in=reads.fq out=trimmed.fq qtrim=r trimq=12 minlen=125

You can set trimq to 30 if you want, but again, I cannot imagine a situation where that would be a good idea. For this command, minlen should be set to some number less than your read length; e.g. for 150bp reads, maybe set it to 100.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Brian Bushnell17k

I am developing a variant caller for detecting heteroplasmy in ngs datasets which will take bam file as input according to workflow. As the link is not working so I have added an image , see the galaxy naive variant caller portion in that figure For calculating heteroplasmy they filtered bases having <30 score.

ADD REPLYlink written 4.2 years ago by ammarsabir1560

It might be a terrible idea in some cases, but when working with ancient DNA, which might have sequencing errors due to oxidative and hidrolytic damage as to contamination, you might want to trim bases below certain quality (usually <30), anyways, i just write this in order to enrich the discussion, i don't know what kind of data ammarsabir is (was) working with.

if in further data handling you have a vcf step, it might be easier to just keep all the data in the bam file, and then trim the reads you want to using vcftools, like following:

vcftools --vcf yourfile.vcf --minQ 30 --recode --recode-INFO-all --out yourtrimmedfile

there's no need to put any extention in the last file name, since the tool writes a default extention.


ADD REPLYlink written 2.9 years ago by ricfoz30
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I recommend completely ignoring the Academy of Sciences, if they suggest ignoring all reads with a single base under Q30. That's a recipe for disaster, since all platforms have sequence-related quality; so, your filtered reads would be extremely biased.

You can, if you wish, trim low-quality reads with the BBMap package like this: in=reads.fq out=trimmed.fq qtrim=r trimq=12

That will optimally trim the low-quality tails of reads, while biasing the output as little as possible. It takes no special precautions to avoid bias, but none of the operations add nearly as much bias as a simple filter for reads with a single base under Q30 would. You can, of course, set trimq to 30 and get bad data, but I don't recommend it.

ADD COMMENTlink written 4.2 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1306 users visited in the last hour