Distribution Of Gene Size
2
4
Entering edit mode
10.8 years ago
Ying W ★ 4.2k

I was plotting things today and I found a funny peak in the distribution of gene size and I was wondering if any of you have encountered something similar.

This is what I did:

2. under Annotation/genes there is a genes.gtf
3. take all exons and create bed file with most 3' exon location and most 5' exon location
4. plot geneend-genestart from the bed file created at step 3 (image above)

I used log scale since there is a big tail, I wanted to be able to visualize it better. The peak around 2^25 is ~28.5Mb and I do see quite a few genes with that size. I was thinking I might have done step3 wrong but I couldn't find anything wrong w/the code

The code for plotting is as follows

ee = read.table('exons.bed', sep="\t")

hist(log2(ee[,3]-ee[,2]), breaks=1000, main="log length of gene body")

gene • 6.6k views
11
Entering edit mode
10.8 years ago
brentp 24k

Looks like something with your annotations. Here's the same thing grabbing sizes from UCSC:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19  -e "select txEnd - txStart from knownGene"  > sizes.txt


Then plotting in R:

hist(log2(read.table('sizes.txt')\$V1), breaks=1000)


gives

1
Entering edit mode

thanks, i need to learn the ucsc tables one of these days.

My problem was that I did not notice that the gtf file from iGenomes included the alternative chromosomes and genes that map to them, thus I was combining exons in genes across different chromosome and that gene_ids are not unique

0
Entering edit mode

Out of interest, anyway to represent the x-axis as number of basepairs?

I'm trying to see a distribution of genes and the lengths (measured in base pairs)

0
Entering edit mode

If you use base pairs instead of log(base pairs) as x-axis, the plot will be skewed to the right and less informative.

0
Entering edit mode
10.8 years ago
Wen.Huang ★ 1.2k

could it be duplicate entries of the same gene (but different isoforms, etc) in your table?

0
Entering edit mode

I don't think so, I collapsed the gtf file by gene name so I only have one entry per gene. I looked at the top/bottom 50 genes listed and there did not seem to be a pattern in the name. I'm actually thinking now that it looks that way based on how I plotted it. ~200 genes out of 20k genes or ~1% of human genes fall in a very narrow window of gene size