Question: How to create 'gene' and 'transcript' entries in a GTF containing only exons
1
gravatar for elsoja
2.0 years ago by
elsoja30
elsoja30 wrote:

The result of Cufflinks is a GTF file that only has exons. I need to create a new GTF that includes 'gene' and 'transcript' entries. Is there a automated way to do that?

Example:

FROM:

chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";

TO:

chr1    Cufflinks       gene    4807788 4842827 .       +       .       gene_id "XLOC_000019";
chr1    Cufflinks       transcript    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4807788 4807982 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "1";
chr1    Cufflinks       exon    4808454 4808486 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "2";
chr1    Cufflinks       exon    4828584 4828649 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "3";
chr1    Cufflinks       exon    4830268 4830315 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "4";
chr1    Cufflinks       exon    4832311 4832381 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "5";
chr1    Cufflinks       exon    4837001 4837074 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "6";
chr1    Cufflinks       exon    4839387 4839488 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "7";
chr1    Cufflinks       exon    4840956 4842827 .       +       .       gene_id "XLOC_000019"; transcript_id "TCONS_00000025"; exon_number "8";
cufflinks gtf • 2.5k views
ADD COMMENTlink modified 17 months ago by gandrescabrera80 • written 2.0 years ago by elsoja30

gffread -E merged.gtf -o- > merged.gff3

~Chirag.

ADD REPLYlink written 2.0 years ago by Chirag Parsania1.2k

Looks like you have not tried anything. You could explore very simple ways of achieving it, like using bedtools groupBy

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,9 -c 4,5 -o min,max

OutPut:

chr1 gene_id "XLOC_000019" 4807788 4842827

cat GeneID_transcript.gtf | sed 's/[;]/\t/g' | groupBy -g 1,10 -c 4,5 -o min,max

chr1 transcript_id "TCONS_00000025" 4807788 4842827

You can tweak around these commands and use pipes or whatever and achieve what you are looking for. If you don't know what a tool or codes given by others is doing, better not to use blindly.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by geek_y8.8k
7
gravatar for Chirag Parsania
2.0 years ago by
Chirag Parsania1.2k
University of Macau
Chirag Parsania1.2k wrote:

What about using one line gffread command on cufflink generated merged.gtf file ?

gffread -E merged.gtf -o- > merged.gff3

This will give you exactly what you have asked.

Thanks, Chirag.

ADD COMMENTlink written 2.0 years ago by Chirag Parsania1.2k

Hi,

Thanks for your answers. I am have the same problem in DEXseq. Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

ADD REPLYlink written 17 months ago by 5140289290
4
gravatar for Persistent LABS
2.0 years ago by
Pune
Persistent LABS740 wrote:

Hi

A lazy solution is provided in perl. Code can be optimized if needed. Better solution must be there. It should work. Let us know if any changes needed. Input file: data.txt Output file: data.gtf

Thanks

Priyabrata

Persistent LABS

open(F,"data.txt");  # Input file
open(F1,">data.gtf"); # Output gtf file

my %gene;  # hash to store gene->transcript->exon information
my %genepos; # hash to store gene's start and end position
my %transpos; # hash to store transcript's start and end position
my %geneinfo; # hash to store information details of gene like chr, strand etc

# Parse input file and create all the data structures in the form of hash
map
{
  my $line=$_;

      # Fetch gene id
      $line=~/gene_id \"([0-9a-zA-Z_]+)\";/;
  my $geneid=$1;

     # Fetch transcript id
$line=~/transcript_id \"([0-9a-zA-Z_]+)\";/;
my $transid=$1;

my @temp=split /\s+/,$line;

push @{$gene{$geneid}{$transid}},$line;
push @{$genepos{$geneid}},@temp[3..4];
push @{$transpos{$geneid}{$transid}},@temp[3..4];
push @{$geneinfo{$geneid}},@temp[0,1,6];
}<F>;

# Print the results in gtf file
map
{
my $temp1=$_;
my @temppos=sort {$a<=>$b} @{$genepos{$temp1}};
print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\tgene\t$temppos[0]\t$temppos[$#temppos]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\";\n";
map
{
    my $temp2=$_;
    my @temppos1=sort {$a<=>$b} @{$transpos{$temp1}{$temp2}};
    print F1 $geneinfo{$temp1}->[0],"\t",$geneinfo{$temp1}->[1],"\ttranscript\t$temppos1[0]\t$temppos1[$#temppos1]\t\.\t$geneinfo{$temp1}->[2]\t\.\tgene_id \"$temp1\"; transcript_id \"$temp2\";\n";   
        map
        {
            print F1 "$_";
        }@{$gene{$temp1}{$temp2}};
    }keys %{$gene{$temp1}}
}keys %gene;
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Persistent LABS740
0
gravatar for 514028929
17 months ago by
5140289290
5140289290 wrote:

Hi,

Thanks for your answers. I am have the same problem in DEXseq.
Firstly, since I don't have the gtf file of honey bee, I use gffread to convert the gff file to the gtf file. however, I had the same problem as shown below: https://support.bioconductor.org/p/96488/

Then, I followed your comment and I convert back gff file:

./gffread -E /Users/LQS/Desktop/Apis_4.5_QS.gtf -o- > /Users/LQS/Desktop/DEXseq_2nd_Apis45.gff3

After that, I stored your perl script in R studio and changed the directory of the input file.

The input file was the one strored in .txt file format and specified the directory of the output gtf file which I want to use for the dexseq_prepare_annotation

I did like this:

open(F,"/Users/LQS/Desktop/DEXseq_2nd_Apis45.txt"); # Input file open(F1,"》/Users/LQS/Desktop/data.gtf"); # Output gtf file

after running: perl XXX.pl
I used the data.gtf for the dexseq_annotation_prepare, the same error happened again:

Traceback (most recent call last): File "/pub40/acdguest/Qiushi/scripts/DEXseq_script/dexseq_prepare_annotation.py", line 54, in <module> f.attr['gene_id'] = f.attr['gene_id'].replace( ":", "_" ) KeyError: 'gene_id'

Could you please help me on this?? Many thanks in advance.

Qiushi Liu

ADD COMMENTlink modified 17 months ago • written 17 months ago by 5140289290
0
gravatar for gandrescabrera
17 months ago by
gandrescabrera80 wrote:

I dont know if you are looking for a R based solution, but since I saw you were using R studio I will give it a shot.

1) Download from Biomart a .gz file with "Exon ID", "Transcript ID", and "Gene ID" 2) Read it on R with the read.delim() function, or read.csv() function 3) Use merge() function to merge you table with the one you DL from biomart. Code should seem something like this:

df <- merge(Cufflinkfile, biomartfile, by =  "Column.names", all.x = T)

all.x = keeps all the "Cufflinkfile" rows. all.y = same than all.x, but for "biomartfile" by = column name you want to merge

Hope it helps you, otherwise I edit.

ADD COMMENTlink written 17 months ago by gandrescabrera80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1062 users visited in the last hour