The result of Cufflinks is a GTF file that only has exons. I need to create a new GTF that includes 'gene' and 'transcript' entries. Is there a automated way to do that?
Example:
FROM:
chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
TO:
chr1    Cufflinks       gene    4807788 4842827 .       +       .       gene_id "XLOC_000019";
chr1    Cufflinks       transcript    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
For anyone reading this years after,
gffreadcould infer transcripts but no genes if the genes were not in your original file (eg in the example above). Nowgffreadcan do it properly with the option --keep-genes (updated - git commit from May 19, 2020). So to get both transcripts AND genes, you can run:gffread -E merged.gtf -o- > merged.gff3
~Chirag.
Looks like you have not tried anything. You could explore very simple ways of achieving it, like using bedtools groupBy
OutPut:
You can tweak around these commands and use pipes or whatever and achieve what you are looking for. If you don't know what a tool or codes given by others is doing, better not to use blindly.