Tutorial: (Closed) InsideDNA: How to use BEDTools for analysis of genome methylation
gravatar for shilparaopradeep
4.5 years ago by
shilparaopradeep190 wrote:

BEDTools is an extensive suite of utilities for genomic features analysis. There are several common genomic file formats, such as: BAM, GFF, GTF, VCF and most frequently BED which are used as an input for the BEDTools utilities. These utilities allow one to perform basic computing and comparison of genomic features. Since input genomic features are represented as genomic intervals, BEDTools can perform the following manipulations with given genomic features: intersect, merge, count, complement and shuffle genomic intervals from multiple files. In this tutorial, we will use BEDTools to study genome methylation and test a hypothesis about methylation within CpG islands and outside.

BEDTools consists of a set of sub-commands (tools) that are invoked as follows from the terminal of InsideDNA platform:

idna_submit.py [wrapper parameters] 'idna_[tool name] [bedtools options]'

The table below summarizes main tools available in the BEDTools suite:

  1. intersectBed - returns overlapping features between two BED/GFF files
  2. genomeCoverageBed - creates histogram or a 'per base' report of genome coverage by reads
  3. bamToBed - converts BAM alignments to BED and BEDPE formats
  4. fastaFromBed - creates FASTA sequences from BED/GFF intervals
  5. maskFastaFromBed - masks a FASTA file based upon BED/GFF coordinates
  6. windowBed - finds overlapping features between BED/GFF files within a 'window'
  7. shuffleBed - permutes the locations of features within a genome
  8. slopBed - adjusts features by a requested number of base pairs
  9. sortBed - sorts BED/GFF files
  10. complementBed - returns intervals not spanned by features in a BED/GFF file

You can read here more information about each BEDTools command or call --help option within InsideDNA terminal: idna_submit.py -th idna_[toolname]

To see all the tools related to BEDTools suit and available on InsideDNA platform, you can grep for a keyword 'bedtool'

alias | grep 'bedtools'

After short description of BEDTools suit let’s try it on practice. In this tutorial, we will look at how CpG dinucleotides (CpG notation is used to distinguish single-stranded linear sequence from the CG base-pairing) are distributed in the genome and check their epigenetic status. In a nutshell, CpG dinucleotides are very unstable in mammalian genomes because of 5’-cytosine methylation in these dinucleotides. Methylation of CpG dinucleotides elevates probability of CpG -> TpG mutations, that’s why mammalian genomes are depleted of CpG dinucleotides. However, high fraction of CpG dinucleotides is observed in so-called CpG islands (genome fragments with high CpG number observed). CpG islands often play regulatory role in mammalian genomes when located near gene promoters so their methylation status is functionally important. In practice, one would see that CpG dinucleotides are often unmethylated in CpG islands while non-CpG island CpGs are mostly methylated.

Firstly, we need to open the terminal by clicking on the Terminal button:

enter image description here

Secondly, type several commands in the terminal window. To create working directory:

mkdir bedtools

to move into working directory:

cd bedtools

to copy the data in working directory:

wget https://insidedna.me/tool_page_assets/tutorials/tutorial32/cpg.tar.gz

to unarchive the data:

tar xvf cpg.zip /data/userXXX/bedtools

to see content of working directory:


enter image description here

You will see 2 files in the working directory:

1 - CpG_isl.bed contains coordinates of CpG islands on the 10th chromosome of human genomes (hg19). This file was downloaded from UCSC Genome Browser. Type following command to read the first lines of the file

head CpG_isl.bed enter image description here

As you can see, these 3 columns show coordinates of CpG islands (chromosome, start and end positions). This file represents a BED format.

2 - methyl_CpG.bed contains coordinates of CpG dinucleotides from 10th chromosome of human genomes (hg19) with known methylation status (not all CpGs). To get this file we used bisulfate whole genome sequencing data from NGSmethDB for B lymphocytes and then processed this file. Use head command to see the content:

head methyl_CpG.bed enter image description here

First 3 columns represent coordinates of CpG dinucleotides, while the last one corresponds to expected methylation status of CpGs (1 – 5’-C methylated; 0 - unmethylated).

Now let’s use BEDTools itself to find CpGs dinucleotides which are and aren’t located in CpG islands. We will use the following commands for this purpose:

idna_submit.py -t bed1 -c 8 -r 30 -e "idna_intersectBed -a /data/userXXX/bedtools/methyl_CpG.bed -b /data/userXXX/bedtools/CpG_isl.bed -u -bed > /data/userXXX/bedtools/isl.bed"

idna_submit.py -t bed2 -c 8 -r 30 -e "idna_intersectBed -a /data/userXXX/bedtools/methyl_CpG.bed -b /data/userXXX/bedtools/CpG_isl.bed -v -bed > /data/userXXX/bedtools/nonisl.bed"

where userXXX is one’s username, idna_submit.py – wrapper script, /data/userXXX/bedtools/methyl_CpG.bed and /data/userXXX/bedtools/CpG_isl.bed – absolute paths (!) to input BED files for BEDTools intersectBed tool, -bed parameter leads to BED output file format.

In these two cases we used different parameters -u and -v; first one means that we’d like to write all features from -a file which intersect with ones in -b file into isl.bed file, while the second one writes features without intersections into nonisl.bed file. One may check the status of these processes by the “TASKS” button in the toolbar at the top of the page:

enter image description here

After completion of these processes we will find two more files (isl.bed and nonisl.bed) in our working directory.

enter image description here

First file contains CpG dinucleotides which are located inside of CpG islands, second file contains CpG dinucleotides from outside of CpG islands; however, structure of these files is the same as one of methyl_CpG.bed (one may check it by head or less commands).

Then let’s check the number of lines in all three files (meth_CpG.bed, isl.bed and nonisl.bed). For that purpose we would use this command:

less filename | wc -l

As a result we will see the following numbers:

meth_CpG.bed  –  357813 (100% CpGs);
isl.bed  –  66380 (19% of CpGs);
nonisl.bed  –  291433 (81% of CpGs);

enter image description here

We can see that the number of CpG dinucleotides outside of CpG islands is 4-fold higher than inside. However, the picture changes when we take into account that CpG islands comprise only about 1.5 % of Human genome. It means that CpG islands are substantially enriched with CpG dinucleotides, because about 20 % of CpG dinucleotides are embedded in 1.5 % of the genome.

Let’s check an average methylation level of CpG dinucleotides in our two files. 4th column of isl.bed and nonisl.bed files contains zeroes and ones for every unmethylated or methylated CpG dinucleotide respectively (see above). It means that average value of 4th column corresponds to average methylation/all ratio of both CpG types. To calculate mean values of 4th columns we will use quite complicated command:

awk '{ total += $4; count++ } END { print total/count }' filename

enter image description here

This result supports our conclusion that CpG dinucleotides from outside of CpG islands are methylated more often (> 60 % methylated) than inside of them (< 1 % methylated).

ADD COMMENTlink modified 4.5 years ago by Devon Ryan98k • written 4.5 years ago by shilparaopradeep190

Come on, please note in the title that this is an "InsideDNA"-specific tutorial. This has been requested multiple times and I'm not going to continue editing your posts. I'm not going to give any more warnings about this.

BTW, all of this can be done in Galaxy without needing a terminal...

ADD REPLYlink written 4.5 years ago by Devon Ryan98k

Hello shilparaopradeep!

We believe that this post does not fit the main topic of this site.

The post is misleading, not useful wrt the original application of the tool, and a verbatim copy of external material. OP has never responded to any request to clarify authorship.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Michael Dondrup48k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2383 users visited in the last hour