Percentage of exons, introns and intergenic regions
1
0
Entering edit mode
16 months ago
Re • 0

Dear all, I am having trouble with an exercise. I should create a bash script and calculate the percentages of exons, introns and intergenic regions from mice and Human . However, Im stocked Maybe you can help me.

wget ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz  # Download the compressed GTF file for HUMAN 
gunzip Homo_sapiens.GRCh38.108.gtf.gz 
grep exon Homo_sapiens.GRCh38.108.gtf | cut -f1,4,5,7 | wc -l
grep intron Homo_sapiens.GRCh38.108.gtf | cut -f1,4,5,7 | wc -l
grep inter Homo_sapiens.GRCh38.108.gtf | cut -f1,4,5,7 | wc -l

Problem: I do not get any output in the last command (inter for intergenic regions) and I do not know how to calculate the chromosome size to calculate the percentage.

Does anyone have a suggestion, I also tried bedtools with no outcome, perhaps bc im using it wrong.

Thanks

intergenic intron regions exon • 1.2k views
ADD COMMENT
2
Entering edit mode
16 months ago
ATpoint 82k

That is because intergenic regions are not annotated in GTF files. GTF files contain only genes (that is the G in GTF) and its subunits (transcripts, UTRs, exons, CDS etc). Intergenic is simple though, it is the complement of the entire genome and the genes. To get this with existing tools make a BED file of the genome containing all chromosomes, so like:

chr1   0    length
chr2   0    length
(...)

...with length being the chromosome length in bp. You can get these chromsizes files from UCSC (scross to the bottom of that link) and other sources, or get the reference genome in fasta format and count the chromosome length with something like bioawk. Then use bedtools complement to complement that genome with the GTF rows of type gene, output is then your intergenic regions.

ADD COMMENT
0
Entering edit mode

Thank you so much. That was very helpful

ADD REPLY
0
Entering edit mode

In addition, I suggest using awk commands instead of grep to avoid ambiguity. For example, if you wish to fetch exons, do awk '$3 == "exon"' Homo_sapiens.GRCh38.108.gtf. You may also want to consider using annotations in GFF3 rather than GTF (i.e. GFF2) format.

ADD REPLY

Login before adding your answer.

Traffic: 1586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6