Tutorial:Differentially methylated regions: Easy & Fast Identification and ANnoTation
1
6
Entering edit mode
6.1 years ago
dec986 ▴ 370

Introduction

I would like to show how to use a new program to identify differentially methylated regions (DMRs) from bisulfite sequencing data, whether whole-genome or RRBS. This tutorial won't be very long, as the program is remarkably easy to use, and doesn't require learning a programming language or writing complex scripts as many other DMR packages do. It is able to run whole-genome bisulfite data even on older laptops in under 2 minutes as a single command. The program, Defiant, or "Differential methylation: Easy, Fast, Identification and ANnoTation" not only identifies differentially methylated regions quickly and easily, but also provides annotations and professional, publication-quality images of DMRs as options. More information is available from our recent publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2037-1

Defiant was designed to use as few computational resources as possible, and to even negate the necessity of paying for supercomputer/cluster resources.

Defiant automatically identifies input file type from bs_seeker and Bismark. Defiant can read the same input data that Metilene, MethylKit, and BSmooth use. This is meant to take as much work off of the end user as possible, so you don't need to worry about any of that with Defiant.

Downloading and Installation

The program can be downloaded from github: https://github.com/hhg7/defiant and installed with a single shell script: install.sh All you need is an installation of gcc available, which should be on any Linux machine anyway. However, if you don't have gcc installed, simply type sudo apt-get install gcc or ask your superuser/sysadmin to do so for you.

Running Defiant

Running defiant is extremely simple

for the simplest case,

1. ./defiant -i control1.txt,control2.txt case2.txt,case2.txt

but defiant can automatically annotate the DMRs for you if you wish with the -a option:

2. ./defiant -a refFlat.txt -i control1.txt,control2.txt case2.txt,case2.txt

Defiant can read data from both refFlat and Gencode GTF formats. You do not need to specify which version you're using, as defiant automatically identifies which format you're using.

Defiant can generate publication-quality images of the DMRs (this requires an installation of R) with a simple command-line option -xwith your desired x-axis label, e.g. CpG

3. ./defiant -x CpG -a refFlat.txt -i control1.txt,control2.txt case2.txt,case2.txt

Output

Defiant's output is very simple and easy to read (best viewed in an spreadsheet program like Excel). The '-b' option prints bed output.

Chromosome  Start   End #mCpN   #Diff.CpN   Mean_Difference control FeDef   Inside_Genes    Between_Genes   Gene_Promoter_Cutoff_10000_Nucleotides
1   2984116 2984175 5   3   -54.1   [68.2,87.8,94.1]    [35.5,3.5,66.7] Ust n/a Ust
1   3283278 3283785 8   2   26.3    [85.1,70.7,62.0]    [98.1,98.7,95.2]    n/a Ust-125047,985340-Samd5 n/a
1   3332817 3332836 5   4   49.0    [47.1,30.0,32.4]    [99.2,79.0,41.3]    n/a Ust-174586,936289-Samd5 n/a

Defiant can also print out adjusted p-values for each DMR with the -v option if desired.

Conclusion

That's it! More detail is available in the publication mentioned above. We go through excruciating detail comparing out new program to existing DMR packages such as Metilene, MethylKit, MethylSig, RnBeads, and RADMeth, and show that Defiant beats all of them in identifying DMRs quickly and easily. Please let me know if you have any questions.

DMR dna-methylation • 5.2k views
ADD COMMENT
0
Entering edit mode

Very easy to use and fast as it is advertised. It is more sensitive than other methods that it can capture the slight changes. The output results sometimes are somewhat truncated. Maybe can be handled by parameter adjustment.

ADD REPLY
0
Entering edit mode

how do you mean 'truncated'? could you provide a specific example, so if something is wrong, I can fix it?

ADD REPLY
0
Entering edit mode

Thank you for developing this tool. It seems very versatile and user friendly. However, defiant has a problem in reading in my bismark outfiles (both formats). It seems that it is not able to recognize the first "chromosome" column. In your examples, you specify, that chromosome names must have the following format: "chr1", however, my chromosome names are as follows: "NW_017547083.1". If I rename these according to the naming convention required by Defiant, it will disable using a gtf file for annotation, which of course contains "original" chromosome names. Is there a way that will make Defiant to recognize other formats of chromosome names as well? Thank you

ADD REPLY
0
Entering edit mode

Hi @vechtova,

that's interesting, I haven't encountered this before.

The problem is that the "chr" string helps Defiant to identify which input type is present. I think that I can try to change this, but it will take a few days.

As a short workaround, something like the script below can add "chr" to a GTF file, which will have the chromosome first.

#!/usr/bin/env perl

use strict;
use warnings;

open my $old, '<', $ARGV[0];
open my $new, '>', "$ARGV[0].chr";
while (<$old>) {
    if (/^#/) {
        print $new $_;
        next
    }
    print $new "chr$_";
}

this script works like "perl add_chr.pl yourfile.gtf" and outputs "yourfile.gtf.chr".

Please let me know how that works, and in the mean time I will try to change the functionality in Defiant.

ADD REPLY
0
Entering edit mode

Thank you a lot for your help! For the first trial, I changed the name of my chromosome column which helped in the identification of the first column and of the input file. However, my computation ended with an error and I was not able to google what it could mean. The following text is a tail of my run log file:


Beginning analysis.

Worker 1/14 is analyzing run 1/1: sets 1 vs. 2 with cutoffs c = 10 CpN = 5 d = 1 p = 0.05 P = 10.000000 S = 0

Now writing nucleotide_T0_T24.tsv

Gathering p-values for p-value adjustment/FDR for chromosome 1... got a bad point: set = 0, COMPARED_SETS[0] = 0

I will greatly appreciate any advice! Thank you

ADD REPLY
0
Entering edit mode

Hi @vechtova

this error occurs if there is a point with 0 coverage (which will lead to undefined methylation values). In other words, these CpG were not detected. Would you be able to eliminate these from your data? Let me know if this fixes your problem.

ADD REPLY
0
Entering edit mode

Hi @dec986, thank you cordially for your help! I finally managed to adjust my data as you suggested and the program did start processing my data!. Unfortunately, the computation was interrupted with the following error:

"Gathering p-/var/spool/pbs/mom_priv/jobs/5160925.meta-pbs.metacentrum.cz.SC: line 30: 39600 Segmentation fault $DEFIANT -b -cpu 14 -E -f -fdr bh -v bh -x CpG -L T0,T24 -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3"

I perform my calculations remotely on a computer cluster and each of my input file is around 40M lines large after zero coverage lines removal. The cluster administrator things that problem is rather in the program code than in the cluster. I will be very grateful for any advice! Thank you

ADD REPLY
0
Entering edit mode

Hi @vechtova,

Unfortunately, I'm unable to reproduce your exact error with my sample data.

Could you try a script with the following commands:

$DEFIANT -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -L T0,T24 -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -b -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -E -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -f -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -x CpG -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -fdr bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
$DEFIANT -L T0,T24 -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3

If your files have about 40 million lines each, this shouldn't take more than few minutes on a laptop. It would help me to narrow down the problem to know which one of these runs fails.

ADD REPLY
0
Entering edit mode

Hi @dec986, I ran all the commands that you listed above and, interestingly, the segmentation fault happened in these instances:

$DEFIANT -fdr bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -L T0,T24 -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3

It seems to me, that the common denominator here is the -fdr option. If it helps, I can share my dataset with you, but I am not sure how to do it. If there is a way to share privately an email or any other option to exchange data privately? I am not sure how it works here. Thank you

ADD REPLY
0
Entering edit mode
3.4 years ago
dec986 ▴ 370

hi @vechtova are you able to send the files on Google Drive? My email is dec986@gmail.com I have a little more than 6 GB available, that should be more than enough for 6 files if you tarball them ("tar cvjf"). Does that method work for you?

ADD COMMENT

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6