Trying to get a distribution of Exon lengths and Intron lengths.
1
2
Entering edit mode
7.9 years ago
remi.b.md ▴ 60

Goal

I am trying to get a distribution of Exons and intron sizes in Three-Spined Stickleback (Gasterosteus aculeatus).

Data downloaded

I downloaded some data from Ensembl. More precisely, I went there, selected "structure" in the "attributes" and selected

- Ensembl Gene ID
- Exon Chr Start (bp)
- Exon Chr End (bp)

Issues

There are two issues in these data:

  1. Some exons end before they start
  2. Some exons overlap

The first point is - I assumed - due to reversion during assembly, so I just reversed the start and end position when there was such inversion. The second point is more problematic as I don't know what such overlap could mean realistically.

Can you help finding the distribution of exon and intron sizes in three-spined stickleback?


Code

Preparation

# read data

d = read.table(file.choose(), header=TRUE)
names(d)= c("ID", "start", "end")

# Reverse exons when needed

toReverse = which(d$start > d$end)
s = d$start[toReverse]
d$start[toReverse] = d$end[toReverse]
d$end[toReverse] = s

Exon Sizes -> Looks good

# Exon sizes

ExonSizes = d$end - d$start
hist(ExonSizes) # Cool, looks good!

Introns Sizes -> Looks bad

# Intron sizes
# This is a little slow. One might have better to switch to C++.
# The idea is to look at the distance between the end of an exon and the start of the next exon within each gene.

d$ID = paste(d$ID)
IntronSizes = c()
uniquegenes = unique(d$ID)
nbgenes = length(uniquegenes)
i=0
exon = 0
for (gene in uniquegenes)
{
    i=i+1
    cat(paste0(i," / ",nbgenes,"\n"))

    while (TRUE)
    {
        exon=exon+1
        if (d$ID[exon] == gene) {
            IntronSizes = append(IntronSizes, d[exon+1,]$start - d[exon,]$end)
        } else 
        {
            break
        }
    }
}
hist(IntronSizes) # There are negative values. I don't know how to interpret them. 

hist(log(abs(IntronSizes))) # That looks good but I doubt it makes much sense!
sequence gene exon intron Ensembl • 3.1k views
ADD COMMENT
0
Entering edit mode
    Some exons end before they start
    Some exons overlap

The first point is - I assumed - due to reversion during assembly, so I just reversed the start and end position when there was such inversion. The second point is more problematic as I don't know what such overlap could mean realistically.

A lot of this depends how good the annotation is that you downloaded. If it was computer assisted/generated then there may be some illogical things in there (like what you point out). This is where human intervention is required and a $1K sequenced genome becomes a $1000K annotated genome.

You may want to look at the problem gene models before reversing the coordinates. Don't do it just because they generate a positive number after reversal.

ADD REPLY
3
Entering edit mode
7.9 years ago
remi.b.md ▴ 60

Download data

In the terminal:

curl ftp://ftp.ensembl.org/pub/release-84/gtf/gasterosteus_aculeatus/Gasterosteus_aculeatus.BROADS1.84.gtf.gz

curl is for MAC OSX users. Use wget for Linux users.

Get Distributions

In R:

library(GenomicFeatures) # This is a "BioConductor" pakcage
txdb = makeTxDbFromGFF("Gasterosteus_aculeatus.BROADS1.84.gtf")
hist(log10(width(unique(exons(txdb))))) # exons
hist(log10(width(unique(unlist(intronsByTranscript(txdb)))))) # introns

I am not sure the data are better but at least there are no negative intron lengths!

ADD COMMENT

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6