Question: Distribution Of Gene Size
4
gravatar for Ying W
7.5 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

I was plotting things today and I found a funny peak in the distribution of gene size and I was wondering if any of you have encountered something similar.

gene size distribution

This is what I did:

  1. download iGenomes UCSC hg19
  2. under Annotation/genes there is a genes.gtf
  3. take all exons and create bed file with most 3' exon location and most 5' exon location
  4. plot geneend-genestart from the bed file created at step 3 (image above)

I used log scale since there is a big tail, I wanted to be able to visualize it better. The peak around 2^25 is ~28.5Mb and I do see quite a few genes with that size. I was thinking I might have done step3 wrong but I couldn't find anything wrong w/the code

The code for plotting is as follows

ee = read.table('exons.bed', sep="\t")

hist(log2(ee[,3]-ee[,2]), breaks=1000, main="log length of gene body")
gene • 4.6k views
ADD COMMENTlink modified 6.8 years ago • written 7.5 years ago by Ying W3.9k
11
gravatar for brentp
7.5 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Looks like something with your annotations. Here's the same thing grabbing sizes from UCSC:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19  -e "select txEnd - txStart from knownGene"  > sizes.txt

Then plotting in R:

hist(log2(read.table('sizes.txt')$V1), breaks=1000)

gives

enter image description here

ADD COMMENTlink written 7.5 years ago by brentp23k
1

thanks, i need to learn the ucsc tables one of these days.

My problem was that I did not notice that the gtf file from iGenomes included the alternative chromosomes and genes that map to them, thus I was combining exons in genes across different chromosome and that gene_ids are not unique

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Ying W3.9k

Out of interest, anyway to represent the x-axis as number of basepairs?

I'm trying to see a distribution of genes and the lengths (measured in base pairs)

ADD REPLYlink written 4.6 years ago by simthian0

If you use base pairs instead of log(base pairs) as x-axis, the plot will be skewed to the right and less informative.

ADD REPLYlink written 4.6 years ago by Ying W3.9k
0
gravatar for Wen.Huang
7.5 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:

could it be duplicate entries of the same gene (but different isoforms, etc) in your table?

ADD COMMENTlink written 7.5 years ago by Wen.Huang1.2k

I don't think so, I collapsed the gtf file by gene name so I only have one entry per gene. I looked at the top/bottom 50 genes listed and there did not seem to be a pattern in the name. I'm actually thinking now that it looks that way based on how I plotted it. ~200 genes out of 20k genes or ~1% of human genes fall in a very narrow window of gene size

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Ying W3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1912 users visited in the last hour