Question: How To Filter Out Reads With A Bad Cigar M Operator
1
gravatar for munch
5.6 years ago by
munch300
Munich
munch300 wrote:

My aim was to run freebayes on my .bam files generated with bwa, but I got the error:

Error: cannot construct subsequence with negative offset or length < 1(attempting start = 66 and length = -8)

So I think there is something wrong with my .bam file. The output of ValidateSamFile.jar from picard is like this:

ERROR: Read groups is empty
ERROR: Record 42921, Read name SRR033861.69041, CIGAR M operator maps off end of reference
ERROR: Record 42922, Read name SRR033861.809038, CIGAR M operator maps off end of reference
ERROR: Record 42923, Read name SRR033861.607221, CIGAR M operator maps off end of reference
ERROR: Record 42949, Read name SRR033861.481003, CIGAR M operator maps off end of reference
...

I have tried to filter this reads out with samtools view -f 0x200 myfile.bam to filter out reads not passing quality controls, but there is no read with this flag.

Is there any tool to filter out this violation of the .bam file? Or you know the reason for the Error message?

Thank you for your efforts in advance

Phil

cigar picard samtools bam • 2.9k views
ADD COMMENTlink modified 5.6 years ago by KCC3.9k • written 5.6 years ago by munch300

Can you please share. how did you fix this ?

ADD REPLYlink written 4.3 years ago by Chirag Nepal2.2k
1
gravatar for KCC
5.6 years ago by
KCC3.9k
Cambridge, MA
KCC3.9k wrote:

Looks like the read is on the end of the chromosome, so start position plus read length is longer than the reference. Perhaps you can just write a script that looks to see what the start position of the read is and drops it if it's longer than the length of the chromosome. To be fully correct, I could parse the CIGAR expression for the exact length of your read in the reference, but I'm going to ignore that for now.

    #syntax: python thisprogram.py infile.sam <read length> outfile.sam

    import sys

    #chromosome ends
    chromosomes = {'chr1': 2000000, 'chr2': 5000000}

    input = open(sys.argv[1])

    readlength = int(sys.argv[2])

    output = open(sys.argv[3],'w')

    for line in input:
            if (not line.startswith('@')):
                    data = line.strip().split("\t")

                    chr   = data[2]
                    start = int(data[3])

                    end = start + readlength

                    if(end >= chromosomes[chr]):
                            continue

                    output.write(line)
            else:
                    output.write(line)

This code is untested. It should work however. You could add the code to parse the CIGAR string with a little more work if that's needed.

ADD COMMENTlink written 5.6 years ago by KCC3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour