Question

Distribution Of Gene Size

4

Entering edit mode

12.0 years ago

Ying W ★ 4.2k

I was plotting things today and I found a funny peak in the distribution of gene size and I was wondering if any of you have encountered something similar.

gene size distribution

This is what I did:

download iGenomes UCSC hg19
under Annotation/genes there is a genes.gtf
take all exons and create bed file with most 3' exon location and most 5' exon location
plot geneend-genestart from the bed file created at step 3 (image above)

I used log scale since there is a big tail, I wanted to be able to visualize it better. The peak around 2^25 is ~28.5Mb and I do see quite a few genes with that size. I was thinking I might have done step3 wrong but I couldn't find anything wrong w/the code

The code for plotting is as follows

ee = read.table('exons.bed', sep="\t")

hist(log2(ee[,3]-ee[,2]), breaks=1000, main="log length of gene body")

gene • 7.3k views

ADD COMMENT • link 11.3 years ago by Ying W ★ 4.2k

score 11 · Answer 1 · 2012-04-20

11

Entering edit mode

12.0 years ago

brentp 24k

Looks like something with your annotations. Here's the same thing grabbing sizes from UCSC:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19  -e "select txEnd - txStart from knownGene"  > sizes.txt

Then plotting in R:

hist(log2(read.table('sizes.txt')$V1), breaks=1000)

gives

enter image description here

ADD COMMENT • link 12.0 years ago by brentp 24k

1

Entering edit mode

thanks, i need to learn the ucsc tables one of these days.

My problem was that I did not notice that the gtf file from iGenomes included the alternative chromosomes and genes that map to them, thus I was combining exons in genes across different chromosome and that gene_ids are not unique

ADD REPLY • link 12.0 years ago by Ying W ★ 4.2k

0

Entering edit mode

Out of interest, anyway to represent the x-axis as number of basepairs?

I'm trying to see a distribution of genes and the lengths (measured in base pairs)

ADD REPLY • link 9.1 years ago by simthian • 0

0

Entering edit mode

If you use base pairs instead of log(base pairs) as x-axis, the plot will be skewed to the right and less informative.

ADD REPLY • link 9.1 years ago by Ying W ★ 4.2k

score 0 · Answer 2 · 2012-04-20

0

Entering edit mode

12.0 years ago

Wen.Huang ★ 1.2k

could it be duplicate entries of the same gene (but different isoforms, etc) in your table?

ADD COMMENT • link 12.0 years ago by Wen.Huang ★ 1.2k

0

Entering edit mode

I don't think so, I collapsed the gtf file by gene name so I only have one entry per gene. I looked at the top/bottom 50 genes listed and there did not seem to be a pattern in the name. I'm actually thinking now that it looks that way based on how I plotted it. ~200 genes out of 20k genes or ~1% of human genes fall in a very narrow window of gene size

ADD REPLY • link 12.0 years ago by Ying W ★ 4.2k