Help with getting data about the gene size distribution
1
0
Entering edit mode
9.1 years ago
ravi • 0

Hi,

I need to find the gene length distributions for multiple organisms. Can someone suggest me possible ways to get the information.

Ravi

Gene size genome NCBI • 1.8k views
ADD COMMENT
0
Entering edit mode

Do you have annotations available for all the organisms of your interest ?

ADD REPLY
0
Entering edit mode

You can directly query ucsc genome browser. I mentioned a mysql command in this post How To Get A List Of All Human Genes Above A Certain Length.

ADD REPLY
4
Entering edit mode
9.1 years ago

Grab gene annotations for your organism of interest, and get them into BED format with convert2bed. For example, here are recent (GRCh38) human annotations via GENCODE:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' \
    | convert2bed -i gff - \
    > genes.bed

Then map genes against themselves to get their lengths. For instance, with bedmap:

$ bedmap --exact --bases genes.bed > gene_lengths.txt

Or with awk:

$ awk '{print $3-$2;}' genes.bed > gene_lengths.txt

The file gene_lengths.txt can be brought into R or similar to make a histogram:

$ R
...
> d <- read.table("gene_lengths.txt", col.names=c("lengths"))
> hist(d$lengths)
ADD COMMENT

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6