The result of Cufflinks is a GTF file that only has exons. I need to create a new GTF that includes 'gene' and 'transcript' entries. Is there a automated way to do that?
Example:
FROM:
chr1 Cufflinks exon 4807788 4807982 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1 Cufflinks exon 4808454 4808486 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1 Cufflinks exon 4828584 4828649 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1 Cufflinks exon 4830268 4830315 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1 Cufflinks exon 4832311 4832381 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1 Cufflinks exon 4837001 4837074 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1 Cufflinks exon 4839387 4839488 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1 Cufflinks exon 4840956 4842827 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
TO:
chr1 Cufflinks gene 4807788 4842827 . + . gene_id "XLOC_000019";
chr1 Cufflinks transcript 4807788 4807982 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1 Cufflinks exon 4807788 4807982 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1 Cufflinks exon 4808454 4808486 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1 Cufflinks exon 4828584 4828649 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1 Cufflinks exon 4830268 4830315 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1 Cufflinks exon 4832311 4832381 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1 Cufflinks exon 4837001 4837074 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1 Cufflinks exon 4839387 4839488 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1 Cufflinks exon 4840956 4842827 . + . gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
For anyone reading this years after,
gffread
could infer transcripts but no genes if the genes were not in your original file (eg in the example above). Nowgffread
can do it properly with the option --keep-genes (updated - git commit from May 19, 2020). So to get both transcripts AND genes, you can run:gffread -E merged.gtf -o- > merged.gff3
~Chirag.
Looks like you have not tried anything. You could explore very simple ways of achieving it, like using bedtools groupBy
OutPut:
You can tweak around these commands and use pipes or whatever and achieve what you are looking for. If you don't know what a tool or codes given by others is doing, better not to use blindly.