Question

Conservation score for a piece of human genome

0

Entering edit mode

10.0 years ago

Floris Brenk ★ 1.0k

Hi all,

I got maybe a bit of naive/stupid question, but I was wondering if there some kind of (online) tool that calculates the conservation between mammals or similar just by inputting coordinates from the human genome. As output I don't really know what to expect. I think a percentage or something would be easy?

I'm have some genomic regions with potential enhancers and intergenic start sites etc and want to know if these regions are conserved which would strengthen these regions.

Thanks, any help is welcome.

conservation human genome • 3.9k views

ADD COMMENT • link updated 10.0 years ago by Alex Reynolds 36k • written 10.0 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

see How To Calculate Conservation Score Of Given Bed Regions

ADD REPLY • link 10.0 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2015-07-28

First, you could obtain PhastCons or phyloP conservation data via UCSC.

I'll show how you can do this with human phyloP data, which extrapolates to other per-base signal types. The general process outlined below will work with most any signal type, where signal falls directly on the base. In the case of signal that falls between bases (such as DNaseI or "cut-count" data) there are per-strand corrections required; these issues should not arise in conservation data scoring.

Intermediate data files are in UCSC BED format: I therefore show how to use high-performance BEDOPS tools to manipulate and map BED files.

I'm also going to use Perl instead of Python to process standard input and output, and R to do calculations. For one, Perl is a more natural choice for I/O-heavy work; Python is just slow. Second, R can do vectorized calculations that make it faster to process large matrices, without requiring installing SciPy or other heavy and fragile dependencies.

Note that PhastCons data can contain gaps, but you can't assume that these gaps are areas of zero or "neutral" conservation. Rather, they are regions simply where no score could be generated. This is, I think, an important distinction in that I do not think it is possible to interpolate gap scores between points where there are data. From prior experience, therefore, I would suggest either using per-base phyloP data or throwing away PhastCons-scored vectors, where such gaps are found.

In any case, as a for instance, here are phyloP 46-way vertebrate conservation data in WIG format: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/

Wget is a common tool for automating the acquisition of files off the web in bulk:

wget -r -l1 -nd -nc -A.wigFix.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/

Once you have the WIG files, you can convert them into a unioned BED file via wig2bed and bedops:

for fn in `ls *.wigFix.gz`; do gunzip -c ${fn} | wig2bed - > ${fn}.bed; done
bedops --everything chr*.bed > vertebrate.phyloP46.bed

Once you have all the per-chromosome signal into one sorted BED file called vertebrate.phyloP46.bed, you can use BEDOPS bedmap to map conservation score signal to gene or other annotations (enhancer regions, etc.).

For this example, you'll need gene annotations as well as the conservation signal. You might get example gene annotations via Sanger and convert them to BED via gff2bed:

wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    | grep -F "gene" - \
    | cut -f1-6 - \
    > genes.bed

Wherever you get your genes or other annotations from, you should now have a file of "reference" elements (genes) and a "map" file (signal). Both files should be sorted BED files, ordered correctly per sort-bed ordering.

Split the all-genes BED file into per-gene BED files, using cut, sort and uniq to acquire a listing of the names of each gene:

mkdir perGene
for name in `cut -f4 genes.bed | sort | uniq`; do grep -F ${name} genes.bed > perGene/${name}.bed; done

(In your specific case, you should have or create a unique identifier for each enhancer or other annotation of interest.)

For each gene instance, you need to split each instance apart by individual bases:

#!/usr/bin/env perl

use strict;
use warnings;

my $cnt = 0;
while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    for (my $meltIdx = $start; $meltIdx < $stop; $meltIdx += 1) {
        print STDOUT join("\t", ($chr, $meltIdx, $meltIdx + 1, $id."!$cnt", $score, $strand))."\n";
    }
    $cnt++;
}

This splits each per-gene BED element into single bases, where each instance of a gene has a unique ID key:

for fn in `ls perGene/*.bed`; do melt.pl < ${fn} > ${fn}.melted.bed; done

The unique ID key is critical, because it is possible that some genes may partially overlap (palindromic regions, particularly, which may be under similar selection pressure) and we need to distinguish and score two overlapping elements separately.

Once you have per-base-split genes, you can score them quickly via bedmap --faster:

for fn in `ls perGene/*.melted.bed`; do bedmap --faster --echo --indicator --echo-map-score --delim '\t' ${fn} vertebrate.phyloP46.bed > ${fn}.vertebrate.phyloP46.bed; done

The result is a BED file for each gene (with the suffix .vertebrate.phyloP46.bed) that contains the phyloP score at each position of each "melted" gene instance, where a score exists. In case of no score, there is a blank field; we account for this later by assigning a score of zero, which should be reasonable for phyloP data.

We now "unmelt" or "freeze" the per-base scores back into a BED7+ file containing a vector of scores for the whole gene element:

#!/usr/bin/perl

use strict;
use warnings;

my $signalRef;
my $keyRef;

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand, $indicator, $signal) = split("\t", $_);
    if (! defined $signalRef->{$id}) {
        push (@{$keyRef}, $id);
        my @id_components = split("!", $id);
        my $fixed_id = $id_components[0];
        $signalRef->{$id}->{chr} = $chr;
        $signalRef->{$id}->{start} = $start;
        $signalRef->{$id}->{stop} = $stop;
        $signalRef->{$id}->{id} = $fixed_id;
        $signalRef->{$id}->{score} = $score;
        $signalRef->{$id}->{strand} = $strand;
        $signalRef->{$id}->{signals} = ();
    }
    if ($indicator == 0) {
        push (@{$signalRef->{$id}->{signals}}, 0);
    }
    else {
        push (@{$signalRef->{$id}->{signals}}, $signal);
    }
}

foreach my $key (@{$keyRef}) {
    my $chr = $signalRef->{$key}->{chr};
    my $start = $signalRef->{$key}->{start};
    my $stop = $signalRef->{$key}->{stop};
    my $id = $signalRef->{$key}->{id};
    my $score = $signalRef->{$key}->{score};
    my $strand = $signalRef->{$key}->{strand};
    my @signals = @{$signalRef->{$key}->{signals}};
    print STDOUT join("\t", ($chr, $start, $stop, $id, $score, $strand))."\t".join("\t", @signals)."\n";
}

To freeze all the melted files:

for fn in `ls perGene/*.melted.bed`; do freeze.pl < ${fn} > ${fn}.frozen.bed; done

Once turned into "frozen" BED files, you can turn them into more convenient "matrix" text files:

#!/usr/bin/perl

use strict;
use warnings;

# Input:
# chrX  155109956  155111957  ENSG00000124333.10  1  +  0  0  0  0 ...

# Output:
# window 0 1 2 3 4 ...
# chrX:155109956:155111957:ENSG00000124333.10:1:+  0  0  0  0 ...

my $idx = 0;
while (<STDIN>) {
    chomp;
    my @elems = split("\t", $_);
    my ($chr, $start, $stop, $id, $score, $strand) = @elems[0..5];
    my @mtx_scores = @elems[6..(scalar @elems - 1)];

    if ($idx == 0) {
        my $len = scalar @mtx_scores;
        print STDOUT "window\t".join("\t", 0..($len-1))."\n";
    }

    my $key = join(":", @elems[0..5]);
    if ($strand eq "+")
    {
        print STDOUT "$key\t".join("\t", @mtx_scores)."\n";
    }
    elsif ($strand eq "-")
    {
        print STDOUT "$key\t".join("\t", reverse @mtx_scores)."\n";
    }
    $idx++;
}

Reversal of score vectors for reverse-stranded elements helps align the positions of conservation scores, so that calculating columnar means is done on aligned bases.

If you want to see the effect, you can split rows by strand and process them separately. This should demonstrate more clearly why reverse is used.

A "frozen" per-gene annotation file can be turned into a per-gene matrix ("mtx") easily:

for fn in `ls perGene/*.frozen.bed`; do mtx.pl < ${fn} > ${fn}.mtx; done

Note that the matrix result is no longer a BED-formatted file, but that doesn't matter: The key in the matrix file's first column can be easily reprocessed downstream, if a BED file is once again needed. Mainly, a matrix file is simply easier to work with in R.

A matrix file is easily imported into R and rendered into a vector of per-base averages:

gene_df <- read.table("gene_xyz.mtx", sep="\t", stringsAsFactors=T, skip=1, row.names=1)
gene_df_means <- as.vector(colMeans(gene_df))
print(gene_df_means)

Replace gene_xyz.mtx with the name of your matrix file. You can do this programmatically with Rscript or similar. If you need to write out per-base averages from within R:

sink("gene_xyz.averages.txt")
cat(paste(gene_df_means, "\n", sep=""))
sink()

Modify filenames, as needed.

view raw biostars-152468.md hosted with ❤ by GitHub