Hi Folks,
I am looking for a tool that can be used to visualize base compositions of different promoters (example image is given below) for a manuscript. Can you suggest a tool (preferably OSX compatible) ?
Thanks in advance.
Venura
Hi Folks,
I am looking for a tool that can be used to visualize base compositions of different promoters (example image is given below) for a manuscript. Can you suggest a tool (preferably OSX compatible) ?
Thanks in advance.
Venura
As an example, say you start with human GENCODE v21 genes, which you can convert to BED with gtf2bed:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene - \
> gencode.v21.genes.bed
You might grab the start position of the forward-stranded genes and then use bedops to pad the upstream range, say by 100 bases (you can adjust padding to match the size of the promoter region you are interested in):
$ awk '$6=="+"' gencode.v21.genes.bed \
| awk '{$3=$2+1; print $0}' - \
| bedops --range 100:0 --everything - \
> gencode.v21.genes.for.upstream100nt.bed
Then repeat this for the reverse strand, using the end position as the transcriptional start site of the gene, padding "downstream" (which is really upstream of the TSS, relative to strand orientation):
$ awk '$6=="-"' gencode.v21.genes.bed \
| awk '{$2=$3-1; print $0}' - \
| bedops --range 0:100 --everything - \
> gencode.v21.genes.rev.upstream100nt.bed
Then union the two files into one BED file:
$ bedops --everything gencode.v21.genes.*.upstream100nt.bed > gencode.v21.genes.upstream100nt.bed
You can use tools like bed2fasta
to convert this stranded BED file into a FASTA file, and then BioPython to convert FASTA to a PPM (e.g. Parsing A Fasta File By Columns).
This is just an example - you might use that kind of approach on your set of genes of interest, or any other approach that gets you a FASTA file of your promoters and the resulting base composition.
Once you have your promoters-of-interest and their base composition in percentages, R can do the visualization step easily. RStudio is also a useful IDE for doing work in R.
Consider setting up row vectors of percentage values over five bases (which sum to 100 over all four bases at each columnar position). You could then generalize the following example code to your specific dataset:
a <- c(20, 20, 20, 40, 20)
c <- c(15, 30, 30, 10, 15)
g <- c(40, 30, 25, 15, 45)
t <- c(25, 20, 25, 35, 20)
percs <- data.frame(a, c, g, t)
x_vals <- c(seq(1, nrow(percs)), rev(seq(1, nrow(percs))))
y_a_vals <- c(rep(0, nrow(percs)), rev(percs$a + percs$c + percs$g + percs$t))
y_t_vals <- c(rep(0, nrow(percs)), rev(percs$c + percs$g + percs$t))
y_g_vals <- c(rep(0, nrow(percs)), rev(percs$c + percs$g))
y_c_vals <- c(rep(0, nrow(percs)), rev(percs$c))
plot(x=x_vals, y=y_a_vals, type='n', ylim=c(0,100), xlab="bases", ylab="percs")
polygon(x_vals, y_a_vals, col='darkgreen', border=NA)
polygon(x_vals, y_t_vals, col='red', border=NA)
polygon(x_vals, y_g_vals, col='yellow', border=NA)
polygon(x_vals, y_c_vals, col='darkblue', border=NA)
This would make a plot that takes on the following appearance:
This should be trivial to markup, once you get a little more familiar with R functions like plot()
, text()
, etc. Use the ?
symbol to get help about functions, e.g. "?plot
" or "?text
".
You can also export PDF from R using the pdf()
function. PDFs can be brought into Adobe Illustrator to polish for publication.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Alex,
Thank you so much for your very comprehensive answer.!!!
This project I am looking at Arabidopsis promoters (I already extracted promoters [-1000] from TAIR) Link to the File.
So I think I can start from PPM.
My other project I am looking at Human MT2A so I can start from the BED file.
Right now I am following R Programing videos @coursera. I am a slow learner but hopefully will get there soon :)
Cheers,
Venura