Question: Does CDS, start_codon and stop_codon in gtf affect transcriptome assembly by StringTie?
1
gravatar for syrttgump
3.3 years ago by
syrttgump30
USA/Newark/New Jersey Institute of Technology
syrttgump30 wrote:

Hi All. I am using StringTie to assemble transcriptome from my RNA-Seq data. The question is that if I use refSeq as reference annotation which was download from UCSC genome browser website, does the CDS, start codon and stop codon segments in that gtf file will affect transcriptome assembly? Like would StringTie consider CDS/start codon/stop codon as a new exon, but actually these features are just parts of exons?

The gtf file looks like this:

chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene exon 66999639 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

ADD COMMENTlink modified 3.3 years ago by Amitm1.6k • written 3.3 years ago by syrttgump30
4
gravatar for Amitm
3.3 years ago by
Amitm1.6k
UK
Amitm1.6k wrote:

hi, Plz. do not use this GTF file from UCSC. A GTF file not only has the information of individual exons (of a transcript isoform) but also of different transcripts (that originate from a particular gene). You would notice that the gene_id and transcript_id are same in the above GTF file. So any transcript assembler you use (like StringTie) would not be able to infer the transcript <-> gene relationship.

See this GTF structure from Ensembl -

1   protein_coding  exon    874655  874840  .   +   .   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "1"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; exon_id "ENSE00002715021";
1   protein_coding  CDS 874655  874840  .   +   2   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "1"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; protein_id "ENSP00000412228";
1   protein_coding  exon    876524  876686  .   +   .   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "2"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; exon_id "ENSE00003477353";
1   protein_coding  CDS 876524  876686  .   +   2   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "2"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; protein_id "ENSP00000412228";

Hope this is clear. Plz use GTF from Ensembl or Gencode

ADD COMMENTlink written 3.3 years ago by Amitm1.6k

Thanks for comment. This is very important to me. Could you tell me where can I get the refSeq, UCSC GTF file in this format? Since I need all of refSeq, UCSC and gencode.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by syrttgump30

Is this a problem? Is that possible that any different transcript belong to different gene?

ADD REPLYlink written 3.3 years ago by syrttgump30

As far as I understand, not having distinct transcript IDs in GTF is a problem, unless there is only one transcript for each gene. This is clearly not the case in humans/ higher vertebrates.

As for the GTF, if your concern is all known/ predicted gene information, then you could consider the Gencode Comprehensive set which has better coverage than RefSeq (here is a related article) alone and would be as good as combining RefSeq and UCSC.

In fact if you visit UCSC Genome Browser, Gencode is now the default gene track.

ADD REPLYlink written 3.3 years ago by Amitm1.6k

Thank you! I have seen the problem: now if I do differential expression analysis, I would get multiple expression values like FPKM values for only one gene, and this will make the DE result very misleading. For the GTF files, your suggest is that gencode comprehensive set had already cover all known/predict genes, then I don't need to consider refSeq and UCSC annotation anymore. Is that right?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by syrttgump30

Yes, thats right. I too have wasted time using GTF from UCSC and then facing the same problem: Transcript isoform level result from Cufflinks didn't make sense.

Anyways, use the Gencode Comprehensive set and you would have better coverage.

ADD REPLYlink written 3.3 years ago by Amitm1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1870 users visited in the last hour