Question: Extract multi exonic genes from gtf files
0
gravatar for waqaskhokhar999
22 months ago by
waqaskhokhar999100 wrote:

I want to get count of multi-exonic genes from the stringtie assembled gtf file of Arabidopsis genome, For example, transcript ( transcript_id "MSTRG.1.2") of gene (gene_id "MSTRG.1") contains 6 exons (exon_number "1", exon_number "2", exon_number "3", exon_number "4", exon_number "5", exon_number "6") while transcript ( transcript_id "MSTRG.2.1") of gene (gene_id "MSTRG.2") contains 1 exon only (exon_number "1"). The output should be like this:

gene_id t_name  num_exons

MSTRG.1 MSTRG.1.2   6

MSTRG.1 MSTRG.1.3   5

MSTRG.2 MSTRG.2.1   1

I have checked this link, but in this link format of gtf file is different.

Sample input:

1   StringTie   transcript  3651    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; 

1   StringTie   exon    3651    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "2"; 

1   StringTie   exon    4506    4605    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "3"; 

1   StringTie   exon    4706    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "4"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "5"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "6"; 

1   StringTie   transcript  3657    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; 

1   StringTie   exon    3657    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "2"; 

1   StringTie   exon    4486    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "3"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "4"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "5"; 

1   StringTie   transcript  15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; 

1   StringTie   exon    15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1"; 

1   StringTie   transcript  6788    11170   1000    -   .   gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
rna-seq exon gtf • 632 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by waqaskhokhar999100
1

your question is not clear. FYI most genes (in human) are multi-exonic.. Could you clarify your question please?

ADD REPLYlink modified 22 months ago • written 22 months ago by Nicolas Rosewick9.3k

What you have to do is to isolate the gene_id and transcript_id part, e.g. using awk and then count e.g. using uniq -c. I strongly suggest you try to solve this yourself using google as this really improves an essential skill in bioinformatics => data sanitation.

ADD REPLYlink written 22 months ago by ATpoint46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2153 users visited in the last hour
_