I would like to show how to use a new program to identify differentially methylated regions (DMRs) from bisulfite sequencing data, whether whole-genome or RRBS. This tutorial won't be very long, as the program is remarkably easy to use, and doesn't require learning a programming language or writing complex scripts as many other DMR packages do. It is able to run whole-genome bisulfite data even on older laptops in under 2 minutes as a single command. The program, Defiant, or "Differential methylation: Easy, Fast, Identification and ANnoTation" not only identifies differentially methylated regions quickly and easily, but also provides annotations and professional, publication-quality images of DMRs as options. More information is available from our recent publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2037-1
Defiant was designed to use as few computational resources as possible, and to even negate the necessity of paying for supercomputer/cluster resources.
Defiant automatically identifies input file type from bs_seeker and Bismark. Defiant can read the same input data that Metilene, MethylKit, and BSmooth use. This is meant to take as much work off of the end user as possible, so you don't need to worry about any of that with Defiant.
Downloading and Installation
The program can be downloaded from github: https://github.com/hhg7/defiant and installed with a single shell script:
install.sh All you need is an installation of
gcc available, which should be on any Linux machine anyway. However, if you don't have
gcc installed, simply type
sudo apt-get install gcc or ask your superuser/sysadmin to do so for you.
Running defiant is extremely simple
for the simplest case,
./defiant -i control1.txt,control2.txt case2.txt,case2.txt
but defiant can automatically annotate the DMRs for you if you wish with the
./defiant -a refFlat.txt -i control1.txt,control2.txt case2.txt,case2.txt
Defiant can read data from both refFlat and Gencode GTF formats. You do not need to specify which version you're using, as defiant automatically identifies which format you're using.
Defiant can generate publication-quality images of the DMRs (this requires an installation of
R) with a simple command-line option
-xwith your desired x-axis label, e.g.
./defiant -x CpG -a refFlat.txt -i control1.txt,control2.txt case2.txt,case2.txt
Defiant's output is very simple and easy to read (best viewed in an spreadsheet program like Excel). The '-b' option prints bed output.
Chromosome Start End #mCpN #Diff.CpN Mean_Difference control FeDef Inside_Genes Between_Genes Gene_Promoter_Cutoff_10000_Nucleotides 1 2984116 2984175 5 3 -54.1 [68.2,87.8,94.1] [35.5,3.5,66.7] Ust n/a Ust 1 3283278 3283785 8 2 26.3 [85.1,70.7,62.0] [98.1,98.7,95.2] n/a Ust-125047,985340-Samd5 n/a 1 3332817 3332836 5 4 49.0 [47.1,30.0,32.4] [99.2,79.0,41.3] n/a Ust-174586,936289-Samd5 n/a
Defiant can also print out adjusted p-values for each DMR with the
-v option if desired.
That's it! More detail is available in the publication mentioned above. We go through excruciating detail comparing out new program to existing DMR packages such as Metilene, MethylKit, MethylSig, RnBeads, and RADMeth, and show that Defiant beats all of them in identifying DMRs quickly and easily. Please let me know if you have any questions.
Very easy to use and fast as it is advertised. It is more sensitive than other methods that it can capture the slight changes. The output results sometimes are somewhat truncated. Maybe can be handled by parameter adjustment.
how do you mean 'truncated'? could you provide a specific example, so if something is wrong, I can fix it?
Thank you for developing this tool. It seems very versatile and user friendly. However, defiant has a problem in reading in my bismark outfiles (both formats). It seems that it is not able to recognize the first "chromosome" column. In your examples, you specify, that chromosome names must have the following format: "chr1", however, my chromosome names are as follows: "NW_017547083.1". If I rename these according to the naming convention required by Defiant, it will disable using a gtf file for annotation, which of course contains "original" chromosome names. Is there a way that will make Defiant to recognize other formats of chromosome names as well? Thank you
that's interesting, I haven't encountered this before.
The problem is that the "chr" string helps Defiant to identify which input type is present. I think that I can try to change this, but it will take a few days.
As a short workaround, something like the script below can add "chr" to a GTF file, which will have the chromosome first.
this script works like "perl add_chr.pl yourfile.gtf" and outputs "yourfile.gtf.chr".
Please let me know how that works, and in the mean time I will try to change the functionality in Defiant.
Thank you a lot for your help! For the first trial, I changed the name of my chromosome column which helped in the identification of the first column and of the input file. However, my computation ended with an error and I was not able to google what it could mean. The following text is a tail of my run log file:
Worker 1/14 is analyzing run 1/1: sets 1 vs. 2 with cutoffs c = 10 CpN = 5 d = 1 p = 0.05 P = 10.000000 S = 0
Now writing nucleotide_T0_T24.tsv
Gathering p-values for p-value adjustment/FDR for chromosome 1... got a bad point: set = 0, COMPARED_SETS = 0
I will greatly appreciate any advice! Thank you
this error occurs if there is a point with 0 coverage (which will lead to undefined methylation values). In other words, these CpG were not detected. Would you be able to eliminate these from your data? Let me know if this fixes your problem.
Hi @dec986, thank you cordially for your help! I finally managed to adjust my data as you suggested and the program did start processing my data!. Unfortunately, the computation was interrupted with the following error:
"Gathering p-/var/spool/pbs/mom_priv/jobs/5160925.meta-pbs.metacentrum.cz.SC: line 30: 39600 Segmentation fault $DEFIANT -b -cpu 14 -E -f -fdr bh -v bh -x CpG -L T0,T24 -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3"
I perform my calculations remotely on a computer cluster and each of my input file is around 40M lines large after zero coverage lines removal. The cluster administrator things that problem is rather in the program code than in the cluster. I will be very grateful for any advice! Thank you
Unfortunately, I'm unable to reproduce your exact error with my sample data.
Could you try a script with the following commands:
If your files have about 40 million lines each, this shouldn't take more than few minutes on a laptop. It would help me to narrow down the problem to know which one of these runs fails.
Hi @dec986, I ran all the commands that you listed above and, interestingly, the segmentation fault happened in these instances:
$DEFIANT -fdr bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3 $DEFIANT -L T0,T24 -x CpG -fdr bh -v bh -i $T0R1,$T0R2,$T0R3 $T24R1,$T24R2,$T24R3
It seems to me, that the common denominator here is the -fdr option. If it helps, I can share my dataset with you, but I am not sure how to do it. If there is a way to share privately an email or any other option to exchange data privately? I am not sure how it works here. Thank you