Average length of each exon and intron
0
3
Entering edit mode
2.2 years ago
kashiff007 ★ 1.8k

Hi,

I want to estimate the average length of each exon and intron (first 3 and last 3) from the gff/gtf file, something shown below:

exon1       500
exon2       400
exon3       300
exon-3      450
exon-2      300
exon-1      299

intron1     500
intron2     400
intron3     300
intron-3    450
intron-2    300
intron-1    299


Here exon1 is the first exon from each gene while exon-1 is the last exon from each gene. Similarly, intron1 is the first exon from each gene while intron-1 is the last exon from each gene.

I tried to search tools for this but haven't found any.

I tried to write a python code but unable to make it workable.

#!/usr/bin/python
import pandas as pd
import numpy as np

df = open("TAIR10_GFF3_genes.gff","r")
df1 = []
val = []
for line in df:
line = line.rstrip("\n")
line = line.split("\t")
if str(line[2])=="exon":
val.append(line[8])
val_unique = set(val)
print(val_unique, len(val_unique),len(val))


The input is the standard gff file:

Chr1    TAIR10  chromosome  1   30427671    .   .   .   ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4486    4605    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4706    5095    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4706    5095    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    5174    5326    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 5174    5326    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    5439    5899    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 5439    5630    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  three_prime_UTR 5631    5899    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  gene    5928    8737    .   -   .   ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10  mRNA    5928    8737    .   -   .   ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1    TAIR10  protein 6915    8666    .   -   .   ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1    TAIR10  five_prime_UTR  8667    8737    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8571    8666    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8571    8737    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8417    8464    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8417    8464    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 8236    8325    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    8236    8325    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7942    7987    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7942    7987    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7762    7835    .   -   2   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7762    7835    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7564    7649    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7564    7649    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7384    7450    .   -   1   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7384    7450    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 7157    7232    .   -   0   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  exon    7157    7232    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  CDS 6915    7069    .   -   2   Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1    TAIR10  three_prime_UTR 6437    6914    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  exon    6437    7069    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  three_prime_UTR 5928    6263    .   -   .   Parent=AT1G01020.1
Chr1    TAIR10  exon    5928    6263    .   -   .   Parent=AT1G01020.1

next-gen gtf gff bed bedtools • 663 views
0
Entering edit mode

There are quite a few issues with your code, the main one being that you are not separating your analysis by mRNA, which is necessary if you want to get the average length of the 1st, 2nd, 3rd exons etc, across all mRNAs. In general, it is better to use existing libraries for parsing specific formats, rather than "inventing the wheel" yourself. In this case, I suggest you take a look at gffutils. In general, what you'll need to do is iterate over all mRNA features, get all exons for each mRNA and aggregate lengths according to exon index (1st, 2nd, etc). At the end you can calculate the average length per index.