Question: ZeroDivisionError: float division by zero, from "methratio.py" in BSMAP
1
gravatar for atoogoe
4.9 years ago by
atoogoe30
United States
atoogoe30 wrote:

Help needed, I was trying to extract methylation ratios from a "bam", which was generated by bisulphite sequence mapping with BSMAP. I am getting "ZeroDivisionError" and been wondering what might be happening. This is the command I issued:

python methratio.py --ref=rn5.fa --out=brain_control_1.txt --context=CG --pair --zero-meth brain_control_1.bam

And this was the message from std output:

[methratio] @Mon Sep 8 17:12:52 2014 loading reference file: rn5.fa ...

[methratio] @Mon Sep 8 17:16:17 2014 reading brain_control_1.bam ...

[methratio] @Mon Sep 8 17:17:19 2014 read 10000000 lines [

methratio] @Mon Sep 8 17:17:34 2014 read 12309445 lines

[methratio] @Mon Sep 8 17:17:34 2014 writing brain_control_1.txt ...

Traceback (most recent call last): File "methratio.py", line 253, in disp('total %d valid mappings, %d covered cytosines, average coverage: %.2f fold.' % (nmap, nc, float(nd)/nc)) ZeroDivisionError: float division by zero

This is a snippet of the bam file:

SRR955121-NP070.24669651 0 chr1 32402 255 36M * 0 0 TCTGCTCGAAAAAGTTAGGTGTGAGGAGCTGTCCTC ____K_aaaa^f^ff_\\_Zfffacaddda\VY_^B NM:i:1 ZS:Z:++ SRR955121-NP070.2905526 0 chr1 32406 255 36M * 0 0 CTCGAAAGAGTTAGGTGTGAGGAGCTGTCCTCACAT hhhhhhhhhhhhhhhghhghhhghhhgghhhhhhhh NM:i:0 ZS:Z:++ SRR955121-NP070.94067 16 chr1 32538 255 36M * 0 0 CTTCCCTACTTAAAGGGACTCGTACGCTTTGGTATG _hcgefhhfaghhhhhhhhhhhhhhhehhhhhfhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.23202665 16 chr1 32549 255 36M * 0 0 AAAGGGACTCGTACGCTTTGGTATGTCATCTGGCAT hhhhhhffadfchhgehhhhhhhfhhghhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.19539704 16 chr1 32566 255 36M * 0 0 TTGGTATGTCATCTGGCATAATTGGCAGCGACCACT fhghhhhhhhfhhghhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.1743333 16 chr1 32568 255 36M * 0 0 GGTATGTCATCTGGCATAATTGGCAGCGACCACTCA gffdfRfggggfaffagdfggddggfffaffffecc NM:i:0 ZS:Z:-+ SRR955121-NP070.6873790 16 chr1 32568 255 36M * 0 0 GGTATGTCATCTGGCATAATTGGCAGCGACCACTCA hhhhhhhhffhfhhhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.16344226 16 chr1 32572 255 36M * 0 0 TGTCATCTGGCATAATTGGCAGCGACCACTCAGAGA ghhhhhfffffQhhhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.14765321 16 chr1 32589 255 36M * 0 0 GGCAGCGACCACTCAGAGAGTGACACGTCCTAGGTG hghhhhhghh_hghhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.3490702 16 chr1 32598 255 36M * 0 0 CACTCAGAGAGTGACACGTCCTAGGTGAGGATAGTT hghhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.7817525 16 chr1 32619 255 36M * 0 0 TAGGTGAGGATAGTTTCCTATATAAGGGATGGTTAT fhhQhhghhhhhhhhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.16151845 16 chr1 32619 255 36M * 0 0 TAGGTGAGGATAGTTTCCTATATAAGGGATGGTTAT c[cccca_ca\aaaaaac\_c]_ZYIccacacaYaa NM:i:0 ZS:Z:-+ SRR955121-NP070.5635950 16 chr1 32621 255 36M * 0 0 GGTGAGGATAGTTTCCTATATAAGGGATGGTTATTC hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.12453704 16 chr1 32738 255 36M * 0 0 GTGCCGCATCGTTCCTGCTGGCGAGATGTAGCGTGG hhhhhhghchgfghhhhhhhhhhhhhhhhhhhhhhh NM:i:0 ZS:Z:-+ SRR955121-NP070.25129342 16 chr1 32738 255 36M * 0 0 GTGCCGCGTCGTTCCTGCTGGCGAGATGTAGCGTGG cgggggffffagagggcgfgcgggggdgfgfggfgg NM:i:0 ZS:Z:-+ SRR955121-NP070.2405969 16 chr1 32739 255 36M * 0 0 TGCCGCATCGTTCCTGCTGGCGAGATGTAGCGTGGG hhhhhhhhhhhhhhhghhghhhhhhhghhhhhghhh NM:i:0 ZS:Z:-+

I will appreciate it very much if you could help me out here. My project is stuck because of this problem. I have sent an email to the authors of BSMAP and still waiting for a response.

Thanks!

ADD COMMENTlink modified 3.7 years ago by vibes100200330 • written 4.9 years ago by atoogoe30
1

Find that line and comment it out. Put a # in front of it.

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 80k

:) I like that solution, but I think it's actually telling him that the program is doing nothing useful.

ADD REPLYlink written 4.9 years ago by Matt Shirley9.0k

I know nothing about the particular software, but the error message is giving you a clue. When the script trys to divide nd by nc you get a ZeroDivisionError (i.e. nc, which appears to be the number of covered cytosines, is 0).  If this is just a "progress" message you could comment it out an skip it as Istvan suggests, but be ware it may indicate trouble with your bam or reference file.

ADD REPLYlink written 4.9 years ago by David W4.7k

Thanks, Matt! What's the best way out?

ADD REPLYlink written 4.9 years ago by atoogoe30

I believe that contacting the authors would be best. 

ADD REPLYlink written 4.9 years ago by Matt Shirley9.0k

This happens whenever a chromosome has no coverage. The easiest solution is to just change the code to check nc>0 before printing.

BTW, if you were asking me about this on SEQanswers and I forgot to reply then sorry about that. It's easy to lose track of threads there.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan91k

Thanks, Devon, I will give it a try and get back to you.

ADD REPLYlink written 4.9 years ago by atoogoe30

So I modified the code to print only when "nc >0". The script ran all right but the output file was blank, save the headers. What this is telling me is that there was no coverage in the ENTIRE genome, not just particular chromosome(s)! Is this normal?

ADD REPLYlink written 4.9 years ago by atoogoe30
1

That'd be pretty unusual. Can you confirm that the chromosomes names in the fasta file are identical to those in the BAM header (just samtools view -H brain_control_1.bam)? The most likely cause of this is aligning against an Ensembl reference and then using a UCSC reference in this step (or vice versa).

ADD REPLYlink written 4.9 years ago by Devon Ryan91k

I've checked. The bam file header has apprx 2739 chromosome entries. I used the same fasta file (downloaded from UCSC), which was used for the mapping. 

ADD REPLYlink written 4.9 years ago by atoogoe30
1

Very strange. Can you post the BAM file somewhere (google drive, drop box, etc.)? That'd make it easier to just track a couple reads and see where the problem is occurring.

ADD REPLYlink written 4.9 years ago by Devon Ryan91k

Thanks! Here is it:

https://www.dropbox.com/s/cs1q26hpx3zqw10/brain_control_1.bam?dl=0

ADD REPLYlink written 4.9 years ago by atoogoe30
2
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

That dataset isn't paired-end, so using the --pair argument is telling it to ignore all of the alignments. Just don't use --pair (I've confirmed that this solves the problem).

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan91k

Thanks a lot! May I know how to determine whether a dataset is paired-end?

ADD REPLYlink written 4.9 years ago by atoogoe30

If you're downloading it from GEO/SRA/etc. then it'll say (on SRA, it's the "Layout" field under "Library"). If there aren't multiple files with names like something_1.fq.gz and something_2.fq.gz then it's probably single-end. When in doubt, ask whoever made the dataset.

ADD REPLYlink written 4.9 years ago by Devon Ryan91k

Thanks a lot!

ADD REPLYlink written 4.9 years ago by atoogoe30

Eternal gratitude, Devon. I am getting results now!

ADD REPLYlink written 4.9 years ago by atoogoe30

Hey Devon. If you have a minute, could you explain how the code is actually determining from the SAM flag that the reads are paired? I'm not sure how testing equality with 'P' for a bitwise flag returns anything meaningful. Thanks.

ADD REPLYlink written 4.9 years ago by Matt Shirley9.0k
1

EDIT :wait scratch that -

Now I recall that there is a string representation of flags as well. One that has characters for flags instead of bitwise integers.

You can see that with the -X flag to samtools view

$ samtools view -X accepted_hits.bam | head -4 | cut -f 2
rs
rs
rs

 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Istvan Albert ♦♦ 80k
1

Nice! But in samtools 1.0 it seems the -X options has been removed:

samtools view -X aln.bam | head
view: invalid option -- X

 

ADD REPLYlink written 4.9 years ago by dariober10k

As Istvan correctly surmised, that script parses the string representation of the flag (don't ask me why someone would think this is programmatically simpler than just doing bit comparisons). The issue is this line:

if options.pair and 'P' not in flag: return []

options.pair is set by --pair and the P value means that the 0x2 bit in the flag is set. One thing to note is that this character representation of the flag is no longer supported in samtools, so I would consider this broken code (Edit: it seems that dariober mentioned this while I was editing methratio.py)!

I find this and the fact that the authors of this tools have been slow to reply to atoogoe quite annoying, so I've just tweaked their code to make it more sensible. A better version of their program that supports both the old and new versions of samtools and doesn't incorrectly deal with the --pair option can be found here.

Edit2: I've submitted a bug report with a link to the the github repo. Hopefully the authors will actually see that some day.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan91k

Wonderful! Thanks for the explanation - this (code that seems broken but apparently has worked) was bugging me more than it should have. Good work.

ADD REPLYlink written 4.9 years ago by Matt Shirley9.0k
1
gravatar for Matt Shirley
4.9 years ago by
Matt Shirley9.0k
Cambridge, MA
Matt Shirley9.0k wrote:

It looks like BSMAP gave you zero converted cytosines (the "nc" variable in the code above) which in float(nd)/nc produces the error. If this is an accurate representation of a line in your BAM file:

 

SRR955121-NP070.24669651 0 chr1 32402 255 36M * 0 0 TCTGCTCGAAAAAGTTAGGTGTGAGGAGCTGTCCTC ____K_aaaa^f^ff_\\_Zfffacaddda\VY_^B NM:i:1 ZS:Z:++

 Then it looks like you have a valid SAM file. However, looking at the rest of the methratio.py script it's clear that whoever wrote it was a Perl programmer at heart :). It's doing some useful work, but there are just too many places in the script where I go "okay, okay, oh... hmmm... why does that work...?" 

The biggest problem is that it ​seems like the script is matching the FLAG (column 2) as a string, and not bitwise or integer values:

This leads me to believe that for your valid SAM file this script is not actually doing anything at all.

I hope you were reading this far because I think I found your issue. It seems like you specify "--pair", and your reads don't appear paired based on their FLAGs. However, my above analysis leads me to believe you'll have no luck without the option.

 

ADD COMMENTlink written 4.9 years ago by Matt Shirley9.0k
1
gravatar for dariober
4.9 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

I've written a program to extract methylation from bam files, bam2methylation.py. The implementation is a bit clunky but it should do the right thing for strand specific BS-Seq data. Essentially it parses the output of samtools mpileup to extract methylation calls.

There are optional arguments to filter reads and base qualities and to get the the number of mismatches.

Example usage:

bam2methylation.py -mq 13 -mm -A --samargs '-q 15' -i $bam -r $ref -l $bed > $out.bdg

This is from the prgram's help:

If you want to give it a go I'd be happy to hear about comments, bugs etc...

bam2methylation.py -h
usage: bam2methylation.py [-h] --input INPUT --ref REF [--l L] [--A]
                          [--samargs SAMARGS] [--region REGION] [--mismatch]
                          [--minq MINQ] [--keeptmp] [--version]

DESCRIPTION
    Extract methylation calls from BAM file. Only positions with non-zero count are printed.
    If you want to get only the Cs in a certain context (e.g. CpG) you need to pass
    a bed file where these positions are. For example, to get all the
    CpGs you could use:
    fastaRegexFinder.py -f mmu.fa -r CG --noreverse > mm9.allcpg.bed
    (Possibly pipe to `cut -f1-3 | gzip` to make the file smaller)

OUTPUT:
   bedGraph with columns:
    <chrom>  <pos-1>  <pos>  <pct meth'd>  <cnt methylated>  <tot count>  <strand> [cnt mismatch]
    Memo: bedGraph is 0-based, so if the first base of chr1 is C it will have position: `chrom 0 1 ... +`
   Output is sorted by chromosome and position.
   
   With --mismatch option an additional column of mismatch count is after <tot count>.
   This is the number of A or G when the reference is C.
   The percent mismatch is given by: $7/($6+$7)

REQUIRES:
      - BAM file sorted and indexed
      - samtools on path
      - bedtools on path
      - Unix sort and awk

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        Input bam file.
                                           
  --ref REF, -r REF     Reference fasta file.
                                           
  --l L, -l L           Bedfile with intervals where pileup should be generated.
                        Passed to `samtools mpileup -l <>`
                                           
  --A, -A               Passed to mpileup: Count anomalous read pairs. Default
                        is to exclude them.
                                           
  --samargs SAMARGS, -s SAMARGS
                        String of optional arguments passed to `samtools view` to filter
                        reads. Put this string in quotes leaving a space after opening quote. See also --region.
                        E.g. -s ' -q 15 -F 256'
                                           
  --region REGION, -R REGION
                        Region passed to `samtools view` to extract reads from.
                        E.g. -R 'chr1:0-10000'
                                           
  --mismatch, -mm       Insert a column of mismatches after the tot methylation count.
                                           
  --minq MINQ, -mq MINQ
                        Minimum base quality required to consider the base a
                        methylation call or a mismatch.
                                           
  --keeptmp             Keep tmp dir. Use for debugging.
                                           
  --version             show program's version number and exit

 

ADD COMMENTlink written 4.9 years ago by dariober10k

Thanks a lot, dariober! I'll give it a go. I really appreciate your input!

ADD REPLYlink written 4.9 years ago by atoogoe30
0
gravatar for atoogoe
4.9 years ago by
atoogoe30
United States
atoogoe30 wrote:

But, while I await response from the authors, is there an alternate way of extracting the methylation ratios/frequencies from the bam file? Thanks!

ADD COMMENTlink written 4.9 years ago by atoogoe30

(See my answer below but I think this should have been a question on its own...)

ADD REPLYlink written 4.9 years ago by dariober10k
0
gravatar for vibes1002003
3.7 years ago by
vibes100200330
Germany
vibes100200330 wrote:

I would like to know reason why number of depth reads supporting particular methylation site in methratio.py does not cross 65,535? As I found in line 160 and 161 of methratio.py (a python program in bsmap to call methylation sites):

except OverflowError: depthcr1[index] = 65535
elif seq[index-pos] == rc_match and depthcr1[index] < 65535

is it any specific reason? if particular methylation site supported by more than this number of reads, how will I get?

 

  
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by vibes100200330
1

The array that those go into has a maximum value of 65535, since it's an unsigned short. In reality, you should ignore any site with coverage even a fraction of that, since it's due to PCR duplication.

ADD REPLYlink written 3.7 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 975 users visited in the last hour