Question: Obtain intronic coordinates from GENCODE database
1
gravatar for Nathaniel
4.2 years ago by
Nathaniel70
Denmark
Nathaniel70 wrote:

I am trying to compute how many reads map into the different functional parts of a gene (exons,introns,UTRs...). To do so, I thought of using the gencode v21 annotation database, which looks like this:

chr21   HAVANA  transcript      10862622        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";

However, I can not find the label for introns. I guess it is column 3, but all the options are:  {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine} 

Which one corresponds to intronic cooridnates?

   

 

gene • 2.5k views
ADD COMMENTlink modified 4.2 years ago by Alex Reynolds27k • written 4.2 years ago by Nathaniel70

You can get intronic coordinates for specific transcript using bedtools subtract (subtract exons coordinates from transcript coordinates).

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by PoGibas4.7k
0
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

A set operation approach would require splitting out exons per transcript, which would work but may involve many thousands of small disk operations, which cumulatively may be very slow.

One way to do this would be to first obtain the genes:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' \
    | convert2bed -i gff - \
    > genes.bed

Make a second file containing exons:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "exon"' \
    | convert2bed -i gff - \
    > exons.bed

Pull gene IDs from the genes.bed file, preserving the order they are discovered in the genes file:

$ cut -f4 genes.bed > ordered_ids.txt

Use a script to walk through this ID list and the exons file in order, to write a set of introns to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

my $id_fn = "ordered_ids.txt";
my $exons_fn = "exons.bed";
open my $id_fh, "<", $id_fn or die "could not open ID file\n";
open my $exons_fh, "<", $exons_fn or die "could not open exons file\n";
my $result;
my $query_id = <$id_fh>;
while (my $exon_ln = <$exons_fh>) {
    chomp $exon_ln;
    my @exon_elems = split("\t", $exon_ln);
    my $exon_id = $exon_elems[3];
    if ($exon_id =~ /$query_id/) {
        # push another exon into result hash table
        my $exon_chr = $exon_elems[0];
        my $exon_start = $exon_elems[1];
        my $exon_stop = $exon_elems[2];
        $result->{$query_id}->{chr} = $exon_chr;
        push @{$result->{$query_id}->{starts}), $exon_start;
        push @{$result->{$query_id}->{stops}), $exon_stop;
    }
    else {
        # write out introns from gaps between previous exon stop and current exon start 
        if (defined $result->{$query_id}->{chr}) {
            my $result_chr = $result->{$query_id}->{chr};
            my @result_starts = @{$result->{$query_id}->{starts};
            my @result_stops = @{$result->{$query_id}->{stops};
            my $result_exon_count = scalar @result_starts;
            for (my $exon_idx = 1; $exon_idx < $result_exon_count - 1; $exon_idx++) {
                my $result_start = $result_stops[$exon_idx - 1]; # previous stop
                my $result_stop = $result_starts[$exon_idx]; # current start
                print STDOUT "$result_chr\t$result_start\t$result_stop\t$query_id\n";
            }
        }
        $query_id = <$id_fh>;  # grab next ID to query
    }
}
close $id_fh;
close $exons_fh;

I think this approach should be faster, because instead of making lots of tiny files to do set operations on, this steps through both files in order only once, printing out introns as their bounds are found.

While exons.bed is in sort-bed order, it isn't guaranteed that the output from this Perl script is in sort-bed order. So just pipe it in:

$ make_introns_from_exons_and_ids.pl | sort-bed - > introns.bed
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds27k

Thank you very much for such elaborated answer Alex! 

ADD REPLYlink written 4.2 years ago by Nathaniel70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1052 users visited in the last hour