Question

How to create 'gene' and 'transcript' entries in a GTF containing only exons

5

Entering edit mode

8.6 years ago

Antonio Camargo ▴ 160

The result of Cufflinks is a GTF file that only has exons. I need to create a new GTF that includes 'gene' and 'transcript' entries. Is there a automated way to do that?

Example:

FROM:

chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";

TO:

chr1    Cufflinks       gene    4807788 4842827 .       +       .       gene_id "XLOC_000019";
chr1    Cufflinks       transcript    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";

cufflinks gtf • 14k views

ADD COMMENT • link updated 8.1 years ago by gandrescabrera ▴ 80 • written 8.6 years ago by Antonio Camargo ▴ 160

3

Entering edit mode

For anyone reading this years after, gffread could infer transcripts but no genes if the genes were not in your original file (eg in the example above). Now gffread can do it properly with the option --keep-genes (updated - git commit from May 19, 2020). So to get both transcripts AND genes, you can run:

gffread -E --keep-genes input.gtf -o- > output.gff3

ADD REPLY • link 5.2 years ago by nlehmann ▴ 150

0

Entering edit mode

gffread -E merged.gtf -o- > merged.gff3

~Chirag.

ADD REPLY • link 8.6 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Looks like you have not tried anything. You could explore very simple ways of achieving it, like using bedtools groupBy

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,9 -c 4,5 -o min,max

OutPut:

chr1 gene_id "XLOC_000019" 4807788 4842827

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,10 -c 4,5 -o min,max

chr1 transcript_id "TCONS_00000025" 4807788 4842827

You can tweak around these commands and use pipes or whatever and achieve what you are looking for. If you don't know what a tool or codes given by others is doing, better not to use blindly.

ADD REPLY • link 8.6 years ago by GouthamAtla 12k

score 7 · Answer 1 · 2016-11-29

7

Entering edit mode

8.6 years ago

Chirag Parsania ★ 2.0k

What about using one line gffread command on cufflink generated merged.gtf file ?

gffread -E merged.gtf -o- > merged.gff3

This will give you exactly what you have asked.

Thanks, Chirag.

ADD COMMENT • link 8.6 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Hi,

Thanks for your answers. I am have the same problem in DEXseq. Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

ADD REPLY • link 8.1 years ago by 514028929 ▴ 10

0

Entering edit mode

Thanks a lot, Your answer solved my question.

ADD REPLY • link 4.6 years ago by hjsunbio ▴ 10

score 4 · Answer 2 · 2016-11-29

Hi

A lazy solution is provided in perl. Code can be optimized if needed. Better solution must be there. It should work. Let us know if any changes needed. Input file: data.txt Output file: data.gtf

Thanks

Priyabrata

Persistent LABS

open(F,"data.txt");  # Input file
open(F1,">data.gtf"); # Output gtf file

my %gene;  # hash to store gene->transcript->exon information
my %genepos; # hash to store gene's start and end position
my %transpos; # hash to store transcript's start and end position
my %geneinfo; # hash to store information details of gene like chr, strand etc

# Parse input file and create all the data structures in the form of hash
map
{
  my $line=$_;

      # Fetch gene id
      $line=~/gene_id \"([0-9a-zA-Z_]+)\";/;
  my $geneid=$1;

     # Fetch transcript id
$line=~/transcript_id \"([0-9a-zA-Z_]+)\";/;
my $transid=$1;

my @temp=split /\s+/,$line;

push @{$gene{$geneid}{$transid}},$line;
push @{$genepos{$geneid}},@temp[3..4];
push @{$transpos{$geneid}{$transid}},@temp[3..4];
push @{$geneinfo{$geneid}},@temp[0,1,6];
}<F>;

# Print the results in gtf file
map
{
my $temp1=$_;
my @temppos=sort {$a<=>$b} @{$genepos{$temp1}};
print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\tgene\t$temppos[0]\t$temppos[$#temppos]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\";\n";
map
{
    my $temp2=$_;
    my @temppos1=sort {$a<=>$b} @{$transpos{$temp1}{$temp2}};
    print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\ttranscript\t$temppos1[0]\t$temppos1[$#temppos1]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\"; transcript_id \"$temp2\";\n";   
        map
        {
            print F1 "$_";
        }@{$gene{$temp1}{$temp2}};
    }keys %{$gene{$temp1}}
}keys %gene;

score 1 · Answer 3 · 2017-06-26

Hi,

Thanks for your answers. I am have the same problem in DEXseq.
Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl
I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

score 0 · Answer 4 · 2017-06-26

I dont know if you are looking for a R based solution, but since I saw you were using R studio I will give it a shot.

1) Download from Biomart a .gz file with "Exon ID", "Transcript ID", and "Gene ID" 2) Read it on R with the read.delim() function, or read.csv() function 3) Use merge() function to merge you table with the one you DL from biomart. Code should seem something like this:

df <- merge(Cufflinkfile, biomartfile, by =  "Column.names", all.x = T)

all.x = keeps all the "Cufflinkfile" rows. all.y = same than all.x, but for "biomartfile" by = column name you want to merge

Hope it helps you, otherwise I edit.