Question

Counting how many transcripts per gene in gtf file

0

Entering edit mode

4.1 years ago

sallyey2 • 0

Hi I want to count the number of transcripts per gene in GTF file, using python 2 on Ubuntu 18.04. Could you help me how to count how many transcripts per gene in GTF file? Your help is much appreciated!

I am using a public gtf.gz file from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz and the output gtf file briefly looks like below, but it's truncated when I copied it from Ubuntu. Each row has more data including gene_type, gene_name, transcript_type, etc.

##description: evidence-based annotation of the human genome (GRCh38), version 33 (Ensembl 99)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-12-13
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; lev
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unproc
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12613   12697   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    12975   13052   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13221   13374   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  exon    13453   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_ps
chr1    HAVANA  gene    14404   29570   .       -       .       gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; hgnc_id
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudo
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; g

rna-seq gene python gtf • 3.2k views

ADD COMMENT • link updated 4.1 years ago by wm ▴ 560 • written 4.1 years ago by sallyey2 • 0

0

Entering edit mode

Please do not delete posts once they have received a comment or an answer. If an answer has solved your problem please accept it by using the green check mark to provide closure to this thread.

ADD REPLY • link 4.1 years ago by GenoMax 141k

score 1 · Answer 1 · 2020-03-19

1

Entering edit mode

4.1 years ago

liorglic ★ 1.4k

Here is a quick option with awk:

awk -F "\t" '$3 == "transcript" {split($9,a,";"); print a[1]}' | sort | uniq -c

but you may also want to check out a library that parses gtf/gff in your favorite programming language.

ADD COMMENT • link 4.1 years ago by liorglic ★ 1.4k

score 0 · Answer 2 · 2020-03-19

First, you did not paste the entire lines above. What you see contains gene_type, ... , is the complete version GTF record.

Second, number of transcripts per gene. the 3-column is transcript, and parse the "gene_id" and "transcript_id", use dict to count transcripts for each gene.

#!/usr/bin/env python3
# filename: aaa.py
# python3

import os
import sys
import gzip

gtf_file = 'input.gtf.gz'

d = {}
with gzip.open(gtf_file, 'rt') as r:
    for line in r:
        if line.startswith('#'):
            continue
        fields = line.strip().split('\t')
        if not fields[2] == 'transcript':
            continue
        gene = tx = ''
        for spec in fields[8].split(';'):
            if spec.startswith('gene_id'):
                gene = spec.replace('gene_id ', '')
            elif spec.startswith('transcript_id'):
                tx = spec.replace('transcript_id ', '')
            else:
                continue
        d[gene] = d.get(gene, 0) + 1

i = 1
for k, v in d.items():
   if i > 3:
       break
   print(k, v)
   i += 1

Here are the result:

$ python aaa.py gencode.v33.annotation.gtf.gz
"ENSG00000223972.5" 2
"ENSG00000227232.5" 1
"ENSG00000278267.1" 1