Question

Looking for DNA base composition Visualisation tool

0

Entering edit mode

9.2 years ago

VenGeno ▴ 100

Hi Folks,

I am looking for a tool that can be used to visualize base compositions of different promoters (example image is given below) for a manuscript. Can you suggest a tool (preferably OSX compatible) ?

< image not found >

Thanks in advance.

Venura

base-composition data-visualization DNA • 2.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by VenGeno ▴ 100

Ram · Accepted Answer · 2015-02-18

As an example, say you start with human GENCODE v21 genes, which you can convert to BED with gtf2bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v21.genes.bed

You might grab the start position of the forward-stranded genes and then use bedops to pad the upstream range, say by 100 bases (you can adjust padding to match the size of the promoter region you are interested in):

$ awk '$6=="+"' gencode.v21.genes.bed \
    | awk '{$3=$2+1; print $0}' - \
    | bedops --range 100:0 --everything - \
    > gencode.v21.genes.for.upstream100nt.bed

Then repeat this for the reverse strand, using the end position as the transcriptional start site of the gene, padding "downstream" (which is really upstream of the TSS, relative to strand orientation):

$ awk '$6=="-"' gencode.v21.genes.bed \
    | awk '{$2=$3-1; print $0}' - \
    | bedops --range 0:100 --everything - \
    > gencode.v21.genes.rev.upstream100nt.bed

Then union the two files into one BED file:

$ bedops --everything gencode.v21.genes.*.upstream100nt.bed > gencode.v21.genes.upstream100nt.bed

You can use tools like bed2fasta to convert this stranded BED file into a FASTA file, and then BioPython to convert FASTA to a PPM (e.g. Parsing A Fasta File By Columns).

This is just an example - you might use that kind of approach on your set of genes of interest, or any other approach that gets you a FASTA file of your promoters and the resulting base composition.

Once you have your promoters-of-interest and their base composition in percentages, R can do the visualization step easily. RStudio is also a useful IDE for doing work in R.

Consider setting up row vectors of percentage values over five bases (which sum to 100 over all four bases at each columnar position). You could then generalize the following example code to your specific dataset:

a <- c(20, 20, 20, 40, 20)
c <- c(15, 30, 30, 10, 15)
g <- c(40, 30, 25, 15, 45)
t <- c(25, 20, 25, 35, 20)
percs <- data.frame(a, c, g, t)

x_vals <- c(seq(1, nrow(percs)), rev(seq(1, nrow(percs))))
y_a_vals <- c(rep(0, nrow(percs)), rev(percs$a + percs$c + percs$g + percs$t))
y_t_vals <- c(rep(0, nrow(percs)), rev(percs$c + percs$g + percs$t))
y_g_vals <- c(rep(0, nrow(percs)), rev(percs$c + percs$g))
y_c_vals <- c(rep(0, nrow(percs)), rev(percs$c))

plot(x=x_vals, y=y_a_vals, type='n', ylim=c(0,100), xlab="bases", ylab="percs")

polygon(x_vals, y_a_vals, col='darkgreen', border=NA)
polygon(x_vals, y_t_vals, col='red', border=NA)
polygon(x_vals, y_g_vals, col='yellow', border=NA)
polygon(x_vals, y_c_vals, col='darkblue', border=NA)

This would make a plot that takes on the following appearance:

plot image

This should be trivial to markup, once you get a little more familiar with R functions like plot(), text(), etc. Use the ? symbol to get help about functions, e.g. "?plot" or "?text".

You can also export PDF from R using the pdf() function. PDFs can be brought into Adobe Illustrator to polish for publication.