Question: Trying to get a distribution of Exon lengths and Intron lengths.
2
gravatar for remi.b.md
2.7 years ago by
remi.b.md60
remi.b.md60 wrote:

Goal

I am trying to get a distribution of Exons and intron sizes in Three-Spined Stickleback (Gasterosteus aculeatus).

Data downloaded

I downloaded some data from Ensembl. More precisely, I went there, selected "structure" in the "attributes" and selected

- Ensembl Gene ID
- Exon Chr Start (bp)
- Exon Chr End (bp)

Issues

There are two issues in these data:

  1. Some exons end before they start
  2. Some exons overlap

The first point is - I assumed - due to reversion during assembly, so I just reversed the start and end position when there was such inversion. The second point is more problematic as I don't know what such overlap could mean realistically.

Can you help finding the distribution of exon and intron sizes in three-spined stickleback?


Code

Preparation

# read data

d = read.table(file.choose(), header=TRUE)
names(d)= c("ID", "start", "end")

# Reverse exons when needed

toReverse = which(d$start > d$end)
s = d$start[toReverse]
d$start[toReverse] = d$end[toReverse]
d$end[toReverse] = s

Exon Sizes -> Looks good

# Exon sizes

ExonSizes = d$end - d$start
hist(ExonSizes) # Cool, looks good!

Introns Sizes -> Looks bad

# Intron sizes
# This is a little slow. One might have better to switch to C++.
# The idea is to look at the distance between the end of an exon and the start of the next exon within each gene.

d$ID = paste(d$ID)
IntronSizes = c()
uniquegenes = unique(d$ID)
nbgenes = length(uniquegenes)
i=0
exon = 0
for (gene in uniquegenes)
{
    i=i+1
    cat(paste0(i," / ",nbgenes,"\n"))

    while (TRUE)
    {
        exon=exon+1
        if (d$ID[exon] == gene) {
            IntronSizes = append(IntronSizes, d[exon+1,]$start - d[exon,]$end)
        } else 
        {
            break
        }
    }
}
hist(IntronSizes) # There are negative values. I don't know how to interpret them. 

hist(log(abs(IntronSizes))) # That looks good but I doubt it makes much sense!
intron exon sequence ensembl gene • 1.4k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by remi.b.md60
    Some exons end before they start
    Some exons overlap

The first point is - I assumed - due to reversion during assembly, so I just reversed the start and end position when there was such inversion. The second point is more problematic as I don't know what such overlap could mean realistically.

A lot of this depends how good the annotation is that you downloaded. If it was computer assisted/generated then there may be some illogical things in there (like what you point out). This is where human intervention is required and a $1K sequenced genome becomes a $1000K annotated genome.

You may want to look at the problem gene models before reversing the coordinates. Don't do it just because they generate a positive number after reversal.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax62k
3
gravatar for remi.b.md
2.7 years ago by
remi.b.md60
remi.b.md60 wrote:

Download data

In the terminal:

curl ftp://ftp.ensembl.org/pub/release-84/gtf/gasterosteus_aculeatus/Gasterosteus_aculeatus.BROADS1.84.gtf.gz

curl is for MAC OSX users. Use wget for Linux users.

Get Distributions

In R:

library(GenomicFeatures) # This is a "BioConductor" pakcage
txdb = makeTxDbFromGFF("Gasterosteus_aculeatus.BROADS1.84.gtf")
hist(log10(width(unique(exons(txdb))))) # exons
hist(log10(width(unique(unlist(intronsByTranscript(txdb)))))) # introns

I am not sure the data are better but at least there are no negative intron lengths!

ADD COMMENTlink written 2.7 years ago by remi.b.md60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1384 users visited in the last hour