Goal
I am trying to get a distribution of Exons and intron sizes in Three-Spined Stickleback (Gasterosteus aculeatus).
Data downloaded
I downloaded some data from Ensembl. More precisely, I went there, selected "structure" in the "attributes" and selected
- Ensembl Gene ID
- Exon Chr Start (bp)
- Exon Chr End (bp)
Issues
There are two issues in these data:
- Some exons end before they start
- Some exons overlap
The first point is - I assumed - due to reversion during assembly, so I just reversed the start and end position when there was such inversion. The second point is more problematic as I don't know what such overlap could mean realistically.
Can you help finding the distribution of exon and intron sizes in three-spined stickleback?
Code
Preparation
# read data
d = read.table(file.choose(), header=TRUE)
names(d)= c("ID", "start", "end")
# Reverse exons when needed
toReverse = which(d$start > d$end)
s = d$start[toReverse]
d$start[toReverse] = d$end[toReverse]
d$end[toReverse] = s
Exon Sizes -> Looks good
# Exon sizes
ExonSizes = d$end - d$start
hist(ExonSizes) # Cool, looks good!
Introns Sizes -> Looks bad
# Intron sizes
# This is a little slow. One might have better to switch to C++.
# The idea is to look at the distance between the end of an exon and the start of the next exon within each gene.
d$ID = paste(d$ID)
IntronSizes = c()
uniquegenes = unique(d$ID)
nbgenes = length(uniquegenes)
i=0
exon = 0
for (gene in uniquegenes)
{
i=i+1
cat(paste0(i," / ",nbgenes,"\n"))
while (TRUE)
{
exon=exon+1
if (d$ID[exon] == gene) {
IntronSizes = append(IntronSizes, d[exon+1,]$start - d[exon,]$end)
} else
{
break
}
}
}
hist(IntronSizes) # There are negative values. I don't know how to interpret them.
hist(log(abs(IntronSizes))) # That looks good but I doubt it makes much sense!
A lot of this depends how good the annotation is that you downloaded. If it was computer assisted/generated then there may be some illogical things in there (like what you point out). This is where human intervention is required and a $1K sequenced genome becomes a $1000K annotated genome.
You may want to look at the problem gene models before reversing the coordinates. Don't do it just because they generate a positive number after reversal.