Closed:InsideDNA: How to use BEDTools for analysis of genome methylation
0
0
Entering edit mode
7.8 years ago

BEDTools is an extensive suite of utilities for genomic features analysis. There are several common genomic file formats, such as: BAM, GFF, GTF, VCF and most frequently BED which are used as an input for the BEDTools utilities. These utilities allow one to perform basic computing and comparison of genomic features. Since input genomic features are represented as genomic intervals, BEDTools can perform the following manipulations with given genomic features: intersect, merge, count, complement and shuffle genomic intervals from multiple files. In this tutorial, we will use BEDTools to study genome methylation and test a hypothesis about methylation within CpG islands and outside.

BEDTools consists of a set of sub-commands (tools) that are invoked as follows from the terminal of InsideDNA platform:

idna_submit.py [wrapper parameters] 'idna_[tool name] [bedtools options]'

The table below summarizes main tools available in the BEDTools suite:

  1. intersectBed - returns overlapping features between two BED/GFF files
  2. genomeCoverageBed - creates histogram or a 'per base' report of genome coverage by reads
  3. bamToBed - converts BAM alignments to BED and BEDPE formats
  4. fastaFromBed - creates FASTA sequences from BED/GFF intervals
  5. maskFastaFromBed - masks a FASTA file based upon BED/GFF coordinates
  6. windowBed - finds overlapping features between BED/GFF files within a 'window'
  7. shuffleBed - permutes the locations of features within a genome
  8. slopBed - adjusts features by a requested number of base pairs
  9. sortBed - sorts BED/GFF files
  10. complementBed - returns intervals not spanned by features in a BED/GFF file

You can read here more information about each BEDTools command or call --help option within InsideDNA terminal: idna_submit.py -th idna_[toolname]

To see all the tools related to BEDTools suit and available on InsideDNA platform, you can grep for a keyword 'bedtool'

alias | grep 'bedtools'

After short description of BEDTools suit let’s try it on practice. In this tutorial, we will look at how CpG dinucleotides (CpG notation is used to distinguish single-stranded linear sequence from the CG base-pairing) are distributed in the genome and check their epigenetic status. In a nutshell, CpG dinucleotides are very unstable in mammalian genomes because of 5’-cytosine methylation in these dinucleotides. Methylation of CpG dinucleotides elevates probability of CpG -> TpG mutations, that’s why mammalian genomes are depleted of CpG dinucleotides. However, high fraction of CpG dinucleotides is observed in so-called CpG islands (genome fragments with high CpG number observed). CpG islands often play regulatory role in mammalian genomes when located near gene promoters so their methylation status is functionally important. In practice, one would see that CpG dinucleotides are often unmethylated in CpG islands while non-CpG island CpGs are mostly methylated.

Firstly, we need to open the terminal by clicking on the Terminal button:

enter image description here

Secondly, type several commands in the terminal window. To create working directory:

mkdir bedtools

to move into working directory:

cd bedtools

to copy the data in working directory:

wget https://insidedna.me/tool_page_assets/tutorials/tutorial32/cpg.tar.gz

to unarchive the data:

tar xvf cpg.zip /data/userXXX/bedtools

to see content of working directory:

ls

enter image description here

You will see 2 files in the working directory:

1. CpG_isl.bed contains coordinates of CpG islands on the 10th chromosome of human genomes (hg19). This file was downloaded from UCSC Genome Browser. Type following command to read the first lines of the file

head CpG_isl.bed

enter image description here

As you can see, these 3 columns show coordinates of CpG islands (chromosome, start and end positions). This file represents a BED format.

2. methyl_CpG.bed contains coordinates of CpG dinucleotides from 10th chromosome of human genomes (hg19) with known methylation status (not all CpGs). To get this file we used bisulfate whole genome sequencing data from NGSmethDB for B lymphocytes and then processed this file. Use head command to see the content:

head methyl_CpG.bed

enter image description here

First 3 columns represent coordinates of CpG dinucleotides, while the last one corresponds to expected methylation status of CpGs (1 – 5’-C methylated; 0 - unmethylated).

Now let’s use BEDTools itself to find CpGs dinucleotides which are and aren’t located in CpG islands. We will use the following commands for this purpose:

idna_submit.py -t bed1 -c 8 -r 30 -e "idna_intersectBed -a /data/userXXX/bedtools/methyl_CpG.bed -b /data/userXXX/bedtools/CpG_isl.bed -u -bed > /data/userXXX/bedtools/isl.bed"
idna_submit.py -t bed2 -c 8 -r 30 -e "idna_intersectBed -a /data/userXXX/bedtools/methyl_CpG.bed -b /data/userXXX/bedtools/CpG_isl.bed -v -bed > /data/userXXX/bedtools/nonisl.bed"

where userXXX is one’s username, idna_submit.py - wrapper script, /data/userXXX/bedtools/methyl_CpG.bed and /data/userXXX/bedtools/CpG_isl.bed - absolute paths (!) to input BED files for BEDTools intersectBed tool, -bed parameter leads to BED output file format.

In these two cases we used different parameters -u and -v; first one means that we’d like to write all features from -a file which intersect with ones in -b file into isl.bed file, while the second one writes features without intersections into nonisl.bed file. One may check the status of these processes by the "TASKS" button in the toolbar at the top of the page:

enter image description here

After completion of these processes we will find two more files (isl.bed and nonisl.bed) in our working directory.

enter image description here

First file contains CpG dinucleotides which are located inside of CpG islands, second file contains CpG dinucleotides from outside of CpG islands; however, structure of these files is the same as one of methyl_CpG.bed (one may check it by head or less commands).

Then let’s check the number of lines in all three files (meth_CpG.bed, isl.bed and nonisl.bed). For that purpose we would use this command:

less filename | wc -l

As a result we will see the following numbers:

meth_CpG.bed  –  357813 (100% CpGs);
isl.bed  –  66380 (19% of CpGs);
nonisl.bed  –  291433 (81% of CpGs);

enter image description here

We can see that the number of CpG dinucleotides outside of CpG islands is 4-fold higher than inside. However, the picture changes when we take into account that CpG islands comprise only about 1.5 % of Human genome. It means that CpG islands are substantially enriched with CpG dinucleotides, because about 20 % of CpG dinucleotides are embedded in 1.5 % of the genome.

Let’s check an average methylation level of CpG dinucleotides in our two files. 4th column of isl.bed and nonisl.bed files contains zeroes and ones for every unmethylated or methylated CpG dinucleotide respectively (see above). It means that average value of 4th column corresponds to average methylation/all ratio of both CpG types. To calculate mean values of 4th columns we will use quite complicated command:

awk '{ total += $4; count++ } END { print total/count }' filename

enter image description here

This result supports our conclusion that CpG dinucleotides from outside of CpG islands are methylated more often (> 60 % methylated) than inside of them (< 1 % methylated).

insideDNA genomics • 4.3k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6