Question: Extract multi exonic genes from gtf files
0
gravatar for waqaskhokhar999
3 months ago by
waqaskhokhar99960 wrote:

I want to get count of multi-exonic genes from the stringtie assembled gtf file of Arabidopsis genome, For example, transcript ( transcript_id "MSTRG.1.2") of gene (gene_id "MSTRG.1") contains 6 exons (exon_number "1", exon_number "2", exon_number "3", exon_number "4", exon_number "5", exon_number "6") while transcript ( transcript_id "MSTRG.2.1") of gene (gene_id "MSTRG.2") contains 1 exon only (exon_number "1"). The output should be like this:

gene_id t_name  num_exons

MSTRG.1 MSTRG.1.2   6

MSTRG.1 MSTRG.1.3   5

MSTRG.2 MSTRG.2.1   1

I have checked this link, but in this link format of gtf file is different.

Sample input:

1   StringTie   transcript  3651    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; 

1   StringTie   exon    3651    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "2"; 

1   StringTie   exon    4506    4605    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "3"; 

1   StringTie   exon    4706    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "4"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "5"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.2"; exon_number "6"; 

1   StringTie   transcript  3657    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; 

1   StringTie   exon    3657    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "1"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "2"; 

1   StringTie   exon    4486    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "3"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "4"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "MSTRG.1.3"; exon_number "5"; 

1   StringTie   transcript  15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; 

1   StringTie   exon    15498   15756   1000    .   .   gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1"; 

1   StringTie   transcript  6788    11170   1000    -   .   gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
rna-seq exon gtf • 245 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by waqaskhokhar99960
1

your question is not clear. FYI most genes (in human) are multi-exonic.. Could you clarify your question please?

ADD REPLYlink modified 3 months ago • written 3 months ago by Nicolas Rosewick8.0k

What you have to do is to isolate the gene_id and transcript_id part, e.g. using awk and then count e.g. using uniq -c. I strongly suggest you try to solve this yourself using google as this really improves an essential skill in bioinformatics => data sanitation.

ADD REPLYlink written 3 months ago by ATpoint21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 985 users visited in the last hour