Extract multi exonic genes from gtf files
0
0
Entering edit mode
4.9 years ago

I want to get count of multi-exonic genes from the stringtie assembled gtf file of Arabidopsis genome, For example, transcript ( transcript_id "MSTRG.1.2") of gene (gene_id "MSTRG.1") contains 6 exons (exon_number "1", exon_number "2", exon_number "3", exon_number "4", exon_number "5", exon_number "6") while transcript ( transcript_id "MSTRG.2.1") of gene (gene_id "MSTRG.2") contains 1 exon only (exon_number "1"). The output should be like this:

gene_id t_name  num_exons

MSTRG.1 MSTRG.1.2   6

MSTRG.1 MSTRG.1.3   5

MSTRG.2 MSTRG.2.1   1

I have checked this link, but in this link format of gtf file is different.

Sample input:

1   StringTie   transcript  3651    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; 

1   StringTie   exon    3651    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "2"; 

1   StringTie   exon    4506    4605    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "3"; 

1   StringTie   exon    4706    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "4"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "5"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "6"; 

1   StringTie   transcript  3657    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; 

1   StringTie   exon    3657    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "2"; 

1   StringTie   exon    4486    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "3"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "4"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "5"; 

1   StringTie   transcript  15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; 

1   StringTie   exon    15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1"; 

1   StringTie   transcript  6788    11170   1000    -   .   gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
RNA-Seq gtf exon • 1.3k views
ADD COMMENT
1
Entering edit mode

your question is not clear. FYI most genes (in human) are multi-exonic.. Could you clarify your question please?

ADD REPLY
0
Entering edit mode

What you have to do is to isolate the gene_id and transcript_id part, e.g. using awk and then count e.g. using uniq -c. I strongly suggest you try to solve this yourself using google as this really improves an essential skill in bioinformatics => data sanitation.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6