Question: How to create 'gene' and 'transcript' entries in a GTF containing only exons
0
gravatar for elsoja
11 months ago by
elsoja20
elsoja20 wrote:

The result of Cufflinks is a GTF file that only has exons. I need to create a new GTF that includes 'gene' and 'transcript' entries. Is there a automated way to do that?

Example:

FROM:

chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";

TO:

chr1    Cufflinks       gene    4807788 4842827 .       +       .       gene_id "XLOC_000019";
chr1    Cufflinks       transcript    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
cufflinks gtf • 755 views
ADD COMMENTlink modified 4 months ago by gandrescabrera70 • written 11 months ago by elsoja20

gffread -E merged.gtf -o- > merged.gff3

~Chirag.

ADD REPLYlink written 11 months ago by Chirag Parsania430

Looks like you have not tried anything. You could explore very simple ways of achieving it, like using bedtools groupBy

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,9 -c 4,5 -o min,max

OutPut:

chr1 gene_id "XLOC_000019" 4807788 4842827

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,10 -c 4,5 -o min,max

chr1 transcript_id "TCONS_00000025" 4807788 4842827

You can tweak around these commands and use pipes or whatever and achieve what you are looking for. If you don't know what a tool or codes given by others is doing, better not to use blindly.

ADD REPLYlink modified 11 months ago • written 11 months ago by geek_y8.1k
7
gravatar for Chirag Parsania
11 months ago by
University of Macau
Chirag Parsania430 wrote:

What about using one line gffread command on cufflink generated merged.gtf file ?

gffread -E merged.gtf -o- > merged.gff3

This will give you exactly what you have asked.

Thanks, Chirag.

ADD COMMENTlink written 11 months ago by Chirag Parsania430

Hi,

Thanks for your answers. I am have the same problem in DEXseq. Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

ADD REPLYlink written 4 months ago by 5140289290
4
gravatar for Persistent LABS
11 months ago by
Pune
Persistent LABS740 wrote:

Hi

A lazy solution is provided in perl. Code can be optimized if needed. Better solution must be there. It should work. Let us know if any changes needed. Input file: data.txt Output file: data.gtf

Thanks

Priyabrata

Persistent LABS

open(F,"data.txt");  # Input file
open(F1,">data.gtf"); # Output gtf file

my %gene;  # hash to store gene->transcript->exon information
my %genepos; # hash to store gene's start and end position
my %transpos; # hash to store transcript's start and end position
my %geneinfo; # hash to store information details of gene like chr, strand etc

# Parse input file and create all the data structures in the form of hash
map
{
  my $line=$_;

      # Fetch gene id
      $line=~/gene_id \"([0-9a-zA-Z_]+)\";/;
  my $geneid=$1;

     # Fetch transcript id
$line=~/transcript_id \"([0-9a-zA-Z_]+)\";/;
my $transid=$1;

my @temp=split /\s+/,$line;

push @{$gene{$geneid}{$transid}},$line;
push @{$genepos{$geneid}},@temp[3..4];
push @{$transpos{$geneid}{$transid}},@temp[3..4];
push @{$geneinfo{$geneid}},@temp[0,1,6];
}<F>;

# Print the results in gtf file
map
{
my $temp1=$_;
my @temppos=sort {$a<=>$b} @{$genepos{$temp1}};
print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\tgene\t$temppos[0]\t$temppos[$#temppos]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\";\n";
map
{
    my $temp2=$_;
    my @temppos1=sort {$a<=>$b} @{$transpos{$temp1}{$temp2}};
    print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\ttranscript\t$temppos1[0]\t$temppos1[$#temppos1]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\"; transcript_id \"$temp2\";\n";   
        map
        {
            print F1 "$_";
        }@{$gene{$temp1}{$temp2}};
    }keys %{$gene{$temp1}}
}keys %gene;
ADD COMMENTlink modified 11 months ago • written 11 months ago by Persistent LABS740
0
gravatar for 514028929
4 months ago by
5140289290
5140289290 wrote:

Hi,

Thanks for your answers. I am have the same problem in DEXseq.
Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl
I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

ADD COMMENTlink modified 4 months ago • written 4 months ago by 5140289290
0
gravatar for gandrescabrera
4 months ago by
gandrescabrera70 wrote:

I dont know if you are looking for a R based solution, but since I saw you were using R studio I will give it a shot.

1) Download from Biomart a .gz file with "Exon ID", "Transcript ID", and "Gene ID" 2) Read it on R with the read.delim() function, or read.csv() function 3) Use merge() function to merge you table with the one you DL from biomart. Code should seem something like this:

df <- merge(Cufflinkfile, biomartfile, by =  "Column.names", all.x = T)

all.x = keeps all the "Cufflinkfile" rows. all.y = same than all.x, but for "biomartfile" by = column name you want to merge

Hope it helps you, otherwise I edit.

ADD COMMENTlink written 4 months ago by gandrescabrera70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1179 users visited in the last hour