Question: Gene feature information missing in Stringtie merged assembly
0
gravatar for waqaskhokhar999
3 months ago by
waqaskhokhar99980 wrote:

I have performed transcriptome assembly of multiple samples using stringtie and then used stringtie merge command to generate a uniform set of transcriptome assembly. The merged assembly gtf file does not contain CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr information as compared to TAIR10 reference annotation file. Is there any way to add gene feature (CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr) information in the new merged assembly

TAIR10 reference assembly

1   araport11   gene    3631    5899    .   +   .   gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";

1   araport11   transcript  3631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   exon    3631    3913    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";

1   araport11   CDS 3760    3913    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   start_codon 3760    3762    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   exon    3996    4276    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2";

1   araport11   CDS 3996    4276    .   +   2   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    4486    4605    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3";

1   araport11   CDS 4486    4605    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    4706    5095    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4";

1   araport11   CDS 4706    5095    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    5174    5326    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5";

1   araport11   CDS 5174    5326    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    5439    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6";

1   araport11   CDS 5439    5627    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   stop_codon  5628    5630    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   five_prime_utr  3631    3759    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   three_prime_utr 5631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

StringTie version 1.3.5

1   StringTie   transcript  3631    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    3631    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    4486    4605    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    4706    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; ref_gene_id "AT1G01010";
rna-seq assembly gtf • 199 views
ADD COMMENTlink modified 3 months ago by Juke-342.8k • written 3 months ago by waqaskhokhar99980
1
gravatar for Juke-34
3 months ago by
Juke-342.8k
Sweden
Juke-342.8k wrote:

Stringtie does transcriptome assembly as you said, it doesn't perform gene annotation. A transcript can be non-coding (so not contain CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr features). And when it is coding, you have to define the gene structure (where it starts and where it stops and in which reading frame). You can have several ORF in a transcripts. The easiest way to go would be to use a tool like TransDecoder i.e Starting from a genome-based transcript structure GTF file (eg. cufflinks or stringtie).
Otherwise you could run an evidence-based annotation (based on your transcripts) using an annotation tool. (e.g MAKER).

ADD COMMENTlink modified 3 months ago • written 3 months ago by Juke-342.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 750 users visited in the last hour