Gene feature information missing in Stringtie merged assembly
1
0
Entering edit mode
4.9 years ago

I have performed transcriptome assembly of multiple samples using stringtie and then used stringtie merge command to generate a uniform set of transcriptome assembly. The merged assembly gtf file does not contain CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr information as compared to TAIR10 reference annotation file. Is there any way to add gene feature (CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr) information in the new merged assembly

TAIR10 reference assembly

1   araport11   gene    3631    5899    .   +   .   gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";

1   araport11   transcript  3631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   exon    3631    3913    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";

1   araport11   CDS 3760    3913    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   start_codon 3760    3762    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   exon    3996    4276    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2";

1   araport11   CDS 3996    4276    .   +   2   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    4486    4605    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3";

1   araport11   CDS 4486    4605    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    4706    5095    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4";

1   araport11   CDS 4706    5095    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    5174    5326    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5";

1   araport11   CDS 5174    5326    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   exon    5439    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6";

1   araport11   CDS 5439    5627    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

1   araport11   stop_codon  5628    5630    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   five_prime_utr  3631    3759    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

1   araport11   three_prime_utr 5631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

StringTie version 1.3.5

1   StringTie   transcript  3631    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    3631    3913    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    3996    4276    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    4486    4605    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    4706    5095    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    5174    5326    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; ref_gene_id "AT1G01010"; 

1   StringTie   exon    5439    5899    1000    +   .   gene_id "MSTRG.1"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; ref_gene_id "AT1G01010";
RNA-Seq Assembly gtf • 2.1k views
ADD COMMENT
1
Entering edit mode
4.9 years ago
Juke34 8.6k

Stringtie does transcriptome assembly as you said, it doesn't perform gene annotation. A transcript can be non-coding (so not contain CDS, start_codon, stop_codon, five_prime_utr, three_prime_utr features). And when it is coding, you have to define the gene structure (where it starts and where it stops and in which reading frame). You can have several ORF in a transcripts. The easiest way to go would be to use a tool like TransDecoder i.e Starting from a genome-based transcript structure GTF file (eg. cufflinks or stringtie).
Otherwise you could run an evidence-based annotation (based on your transcripts) using an annotation tool. (e.g MAKER).

ADD COMMENT

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6