How to get the total genic and intergenic length of a chromosome?
2
0
Entering edit mode
4.6 years ago
Zee_S ▴ 60

Hello Biostars community!

Can you kindly suggest me a method to compute the total genic and intergenic length of a chromosome?

Thank you very much for your help!

genic intergenic length • 2.1k views
0
Entering edit mode

Do you have annotations available for this genome?

0
Entering edit mode

Hello genomax,

Yes, I have the coordinates of annotated genes in a bed file. and I also have a sizes.genome file with the chromosomes sizes.

0
Entering edit mode

I assume you are asking this question because you are not familiar with a scripting language? If you are, you should be able to add up intervals for your genes using bed file start/stops and keep track of regions in between.

There may be a way to do this using bedops, awk and/or a combination of both. Can you post the output of 10-15 lines (head -15 your.bed) from your bed file to give people an idea of what you have?

0
Entering edit mode

I am posting here a 10-line snapshot of my genes.bed: thank you for your help!

gene    1666    2818
gene    4096    5114
gene    21496   28507
gene    40486   46470
gene    49036   54240
gene    73329   91655
gene    99448   122165
gene    133623  138575
gene    141258  149665
gene    151974  157575


I did the following but i am not sure its correct:

I merged overlapping genes with bedtools merge. then I created 100 bp bins of the merged intervals using bedtools window maker. then summed up the 100bp bins. I took this as the total genic length. is it correct? because when I sum up the differences of the merged intervals without binning, the value is different.

to get the intergenic length, I used bedtoools complement. and made 100bp bins of complement intervals and summed up these bins to get the "intergenic length". is this correct?

Is there a better way to do it?

Many thanks!

0
Entering edit mode

What does gene equate to, exons?

0
Entering edit mode

gene equates to TSS to TES. exon coordinates are within these intervals in a separate gtf file that I didn't show here.

0
Entering edit mode

does genic length mean excluding introns? because for my analyses I need to take intron coordinates into account. in that case, do I have to extract intron coordinates and use these in the intergenic calculation? thanks for your help

0
Entering edit mode

If you only want to include what gets translated then you would need to exclude introns/UTR's etc. Otherwise TSS to TES is the full gene sequence.

I need to take intron coordinates into account

What does that mean?

I am not sure what you are doing above in terms of binning things. Why not keep a single interval from TSS to TES as gene.

0
Entering edit mode

Edit: Yeah, never mind.

1
Entering edit mode
4.6 years ago

Since you have a GTF file, you can do the following in R:

library(GenomicRanges)
library(rtracklayer)
gtf = import("genes.gtf")  # or whatever it's called
foo = split(gtf, seqnames(gtf))
sapply(foo, function(x) sum(width(reduce(x))))


You now have a list printed to the screen of the number of bases in a gene in each chromosome. You can merge the last two lines together if you'd like.

1
Entering edit mode
4.6 years ago

It looks like you have a .gtf file. That means you can extract the exon lines from the .gtf file and count and sum up the exonic intervals.

You can generate a sorted .bed file of exon coordinates by:

grep -P '\texon\t' your.gtf | cut -f 1,4,5 | sort -k1,1 -k2,2n > exons.bed


You can merge this exons.bed using bedtools:

bedtools merge -i exons.bed > exons.merged.bed


You can count/sum the intervals in the merged bed file:

awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' exons.merged.bed


This should give you number of genic bases, if you are defining genic by just exons. To get intergenic, just sum up your chromosome lengths and subtract the genic number.

You can do all this in one line also.

grep -P '\texon\t' your.gtf | cut -f 1,4,5 | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

0
Entering edit mode

Hello everyone,

You can't extract the exon lines directly from the .gtf file and work with it like if it was a bed file. Bed files are 0-based and gtf files are 1-based. First, we have to convert the gtf file in a bed file. We can do it with sortBed and gff2bed from Bedops:

# Convert from gff to bed
sortBed -i your.gtf | gff2bed > your.bed


And then extract the columns of interest from your new bed file.

You can check an example in my repository: https://github.com/tmontserrat/proportion_exon_regions