How to identify and count multi-exon genes
1
0
Entering edit mode
6.3 years ago
mail2steff ▴ 70

I am trying to analyze alternative splicing events in five different samples. I've aligned the samples to the reference genome using Tophat and assembled the transcripts using cufflinks. I obtained 5 different gtf files from cufflinks. I have to identify and count muti-exon genes from each samples. How can I get the number of multi-exon genes?

cufflinks multi-exon • 3.0k views
ADD COMMENT
0
Entering edit mode

Have you looked at featureCounts? If you have 5 separate GTF files you can use them individually to get gene level counts.

ADD REPLY
0
Entering edit mode

No. I have not. Will I get the number of "multi-exons" genes from each sample?

ADD REPLY
2
Entering edit mode

It is possible outside R. Btw, what is the cut off for multi exon genes. Following code is for genes with more than 1 exon.

Example gtf:

$ head ucsc.refgene.hg38.gtf 
chr1    refGene exon    11874   12227   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "1"; exon_id "NR_046018.1"; gene_name "DDX11L1";
chr1    refGene exon    12613   12721   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "2"; exon_id "NR_046018.2"; gene_name "DDX11L1";
chr1    refGene exon    13221   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018"; exon_number "3"; exon_id "NR_046018.3"; gene_name "DDX11L1";
chr1    refGene exon    17369   17436   .   -   .   gene_id "MIR6859-1"; transcript_id "NR_106918"; exon_number "1"; exon_id "NR_106918.1"; gene_name "MIR6859-1";
chr1    refGene exon    17369   17436   .   -   .   gene_id "MIR6859-2"; transcript_id "NR_107062"; exon_number "1"; exon_id "NR_107062.1"; gene_name "MIR6859-2";
chr1    refGene exon    17369   17436   .   -   .   gene_id "MIR6859-3"; transcript_id "NR_107063"; exon_number "1"; exon_id "NR_107063.1"; gene_name "MIR6859-3";
chr1    refGene exon    17369   17436   .   -   .   gene_id "MIR6859-4"; transcript_id "NR_128720"; exon_number "1"; exon_id "NR_128720.1"; gene_name "MIR6859-4";
chr1    refGene exon    30366   30503   .   +   .   gene_id "MIR1302-2"; transcript_id "NR_036051"; exon_number "1"; exon_id "NR_036051.1"; gene_name "MIR1302-2";
chr1    refGene exon    30366   30503   .   +   .   gene_id "MIR1302-9"; transcript_id "NR_036266"; exon_number "1"; exon_id "NR_036266.1"; gene_name "MIR1302-9";
chr1    refGene exon    30366   30503   .   +   .   gene_id "MIR1302-10"; transcript_id "NR_036267"; exon_number "1"; exon_id "NR_036267.1"; gene_name "MIR1302-10";

Output (need to add gene symbol and exon number as header:)):

$ head ucsc.refgene.hg38.gtf | gtf2bed | datamash -sg 4 count 8 | awk '$2 > 1'
DDX11L1 3

To get genes with one exon (2872 genes):

$ convert2bed --input=gtf --output=bed < ucsc.refgene.hg38.gtf | datamash -sg 4 count 8 | awk '$2==1' | wc -l

To get genes with more than one exon (25155 genes):

$ convert2bed --input=gtf --output=bed < ucsc.refgene.hg38.gtf | datamash -sg 4 count 8 | awk '$2>1' | wc -l
ADD REPLY
0
Entering edit mode

Thank you so much for the answer. Ill try this. And there is no cut-off for number of exons for a single gene

ADD REPLY
1
Entering edit mode

np. next time, please post example data.

ADD REPLY
0
Entering edit mode
6.3 years ago
EagleEye 7.5k

I assume you want to identify multi-exon transcripts from your GTF file. I hope this post will help you making table with number of exons per transcript.

ADD COMMENT

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6