average distance between genes
1
0
Entering edit mode
4.6 years ago
Ric ▴ 430

Hi, Is there a way to calculate the average distance between genes and exons from a GFF3 file?

Thank you in advance,

assembly gene annotation • 1.5k views
ADD COMMENT
1
Entering edit mode

sure. make sure gff is position sorted. grep out the gene lines, substract end of previous gene from start of current gene, collect distance, calculate mean/median distance.

ADD REPLY
0
Entering edit mode
4.5 years ago

Here's a one-liner that uses BEDOPS closest-features on a UCSC-derived refGene list of genes:

$ closest-features --closest --no-ref --no-overlaps --dist refGene.hg38.bed refGene.hg38.bed | cut -d'|' -f2 | grep -v NA | awk '{ if($1<=0){ $1*= -1;} print $1;}' | Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    1185    6411   25654   22165 1687452

The median distance between genes is 6411nt. The mean is 25kb, etc.

The file refGene.hg38.bed is sorted with BEDOPS sort-bed.

If you're starting from GFF3, you can use BEDOPS gff2bed:

$ awk '($3 == "gene")' annotations.gff | gff2bed - > annotations.bed

To incorporate into the above one-liner, using bash process substitutions:

$ closest-features --closest --no-ref --no-overlaps --dist <(awk '($3 == "gene")' annotations.gff | gff2bed -) <(awk '($3 == "gene")' annotations.gff | gff2bed -) | cut -d'|' -f2 | grep -v NA | awk '{ if($1<=0){ $1*= -1;} print $1;}' | Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'
ADD COMMENT

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6