Measuring Conservation For Short Motif
Entering edit mode
10.7 years ago
Stephen 2.8k

How can I measure sequence conservation among different species for a very short (5-8 nt) motif in introns?

I know I can visualize conservation in UCSC, but I want something more quantitative. I'm looking at RNA-binding protein recognition motifs that occur in introns flanking the alternative exon. I want to show that this motif is more conserved than a "normal" intron. This paper did it, but didn't really tell how:

We analyzed sequence conservation of pentamers in the mouse intronic regions to identify potential splicing regulatory elements. The mouse introns were aligned to 7 other mammalian genomes that have at least 5x coverage in the UCSC 28-way multigenome alignment. For each pentamer in each region [upstream and downstream of exon], a conservation rate (CR) was calculated as the fraction of aligned and conserved occurrences among total occurrences. The significance of CR of each pentamer is evaluated by comparing with 10 other pentamers with similar expected CR calculated using the first-order Markov model. This procedure essentially controls for possible sequence bias in the dataset. A P value was calculated by using the binomial distribution.

Thanks in advance!

conservation transcription • 6.1k views
Entering edit mode
10.7 years ago
Ian 5.9k

Here is a somewhat low-tech and convoluted way of doing what you want using Phastcons scores, as i have no idea what the paper you mentioned did.

1) Get PhastCons scores from UCSC:

There are various phastCons score files, depending on the genomes in the pileup. E.g.

You can get them by ftp:

user name: anonymous
password: <your email address>
cd goldenPath/mm9/phastCons30way/vertebrate
mget -a

2) Convert wig to bigWig:

Each chromosome file of scores (wig files) can be converted into bigWig files. 64-bit binaries come from:

The fetchChromSizes script is required to get the chromosome length for the 'wigToBigWig' script:

fetchChromSizes mm9 > mm9.chrom.sizes

Run wigToBigWig on each chromosome file, e.g:

wigToBigWig -clip mm9.chrom.sizes

3) Calculate mean phastCons score for the coordinates of interest:

Another script called bigWigSummary can be obtained from UCSC, using the link above. It takes in a set of genome coordinate and will spit out the max or mean of values contained in the bigWig file, in our case phastCons scores.

Below is a Perl script i wrote (below) that retrieves max and mean scores. Paths to files and scripts will need editing.

This is at least a backup if you don't get anything else!

#!/usr/bin/perl -w

use strict;

# Get highest and mean phastCons score for each binding region in BED file
# 10 Sept 2010.  Ian Donaldson.
# 20 Oct 2010 modified to read BED format

# Command line usage

if(@ARGV!=2) {
   die ("USAGE: $0 | Regions BED file | Output\n");

open(BED, "<$ARGV[0]") or die("Could not open input BED file!\n"); 
open(OUTPUT, ">$ARGV[1]") or die("Could not open output file!\n");

# Work thru each line of BED file
while(defined(my $line = <BED>)) {

   my ($chr, $start, $end) = '';

   # Skip line if line does not start with 'chr'
   unless($line =~ /^chr/) { next }

   # Remove EOL from each line

   # Split each line by TAB
   my @line_bits = split(/\t/, $line);

   $chr = $line_bits[0];
   $start = $line_bits[1];
   $end = $line_bits[2];

   # Run bigWigSummary 
   system("/home/idonaldson/bin/UCSC/bigWigSummary -type=max /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$ $chr $start $end 1 > $$.max_tmp 2>&1");
   system("/home/idonaldson/bin/UCSC/bigWigSummary -type=mean /home/idonaldson/genomes/mm9_vertebrate_phastCons/bw/$ $chr $start $end 1 > $$.mean_tmp 2>&1");

   # open temp file
   open(MAX_TEMP, "<$$.max_tmp") or die("Cannot open max temp file!\n");
   open(MEAN_TEMP, "<$$.mean_tmp") or die("Cannot open mean temp file!\n");

   # MAX phastCons
   my $max_temp_output = '';

   while(<MAX_TEMP>) {
      $max_temp_output .= $_;

   # close max temp file

   # remove blank lines s/^$//g
   if($max_temp_output =~ /^no/) { 
      $max_temp_output = '';

   # MEAN phastCons
   my $mean_temp_output = '';

   while(<MEAN_TEMP>) {
      $mean_temp_output .= $_;

   # close max temp file

   # remove blank lines s/^$//g
   if($mean_temp_output =~ /^no/) { 
      $mean_temp_output = '';

   # save temp file variable to $ARGV[1]
   print OUTPUT "$line_bits[0]\t$line_bits[1]\t$line_bits[2]\t$max_temp_output\t$mean_temp_output\n";

# remove temporary file to final output
system("rm $$.max_tmp");
system("rm $$.mean_tmp");

# close files

Entering edit mode

I think this code is very useful. Thanks.

Entering edit mode

great effort lan ! i used it and found interesting. but do you think mean and max phast score would work if we need to find conservastion of bit longer regions , say 500-3000 bps? chr7:21114483-21117423 (hg19) has phastcons max =1 and mean= 0.437672 when i check it in UCSC browser, its conserved down to fish ! where as chr7:20838708-20841649 (hg19) max=1 and mean =0.459534 while on browser its conserved to mammals only! why is this contradiction here. may be there is some misunderstanding in my concept about phastcons score. can you please clarify me ? i actually want to find conservation depth of many such regions, and based on 46 vertebrate multiple alignments phastcons score , i am trying to identify distant most specie to which an element/region is conserved.


Login before adding your answer.

Traffic: 850 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6