Entering edit mode
4.6 years ago
Ric
▴
430
Hi, Is there a way to calculate the average distance between genes and exons from a GFF3 file?
Thank you in advance,
Hi, Is there a way to calculate the average distance between genes and exons from a GFF3 file?
Thank you in advance,
Here's a one-liner that uses BEDOPS closest-features
on a UCSC-derived refGene list of genes:
$ closest-features --closest --no-ref --no-overlaps --dist refGene.hg38.bed refGene.hg38.bed | cut -d'|' -f2 | grep -v NA | awk '{ if($1<=0){ $1*= -1;} print $1;}' | Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1185 6411 25654 22165 1687452
The median distance between genes is 6411nt. The mean is 25kb, etc.
The file refGene.hg38.bed
is sorted with BEDOPS sort-bed
.
If you're starting from GFF3, you can use BEDOPS gff2bed
:
$ awk '($3 == "gene")' annotations.gff | gff2bed - > annotations.bed
To incorporate into the above one-liner, using bash
process substitutions:
$ closest-features --closest --no-ref --no-overlaps --dist <(awk '($3 == "gene")' annotations.gff | gff2bed -) <(awk '($3 == "gene")' annotations.gff | gff2bed -) | cut -d'|' -f2 | grep -v NA | awk '{ if($1<=0){ $1*= -1;} print $1;}' | Rscript -e 'summary(as.numeric(read.table(file("stdin"))[,1]))'
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
sure. make sure gff is position sorted. grep out the gene lines, substract end of previous gene from start of current gene, collect distance, calculate mean/median distance.