Question

Cuffdiff result does not contain gene name and gene id is also of invalid identifier type.

0

Entering edit mode

6.0 years ago

adoorprabha • 0

Hi all,

I had run the basic Tuxedo pipeline for my RNA-seq analysis.My work doesn't contain biological replicates for used conditions.Hence i have not performed merging of transcripts,instead i have performed directly cuffdiff from cufflinks results.In cufflinks results i got gene short names and the gene ids were in the format :-

gene30247
gene30249
gene30253
gene30254
gene30255

The gene short name format was:-

LOC104645354
LOC101266925
LOC101266621
ARF6
LOC104645357

I have used the NCBI gene annotation file for solanum lycopersicum with the format gff3.As i got result from cufflinks as i mentioned above i thought i would get the result in the next step which is cuffdiff.But i got the gene id as above i mentioned and i did'nt get the gene names in the cuffdiff result.

I also tried the above tools with the ensembl gene annotation file for solanum lycopersicum but since the annotation file conatins the seqname as 1 instead of chromosome it is showing the following error.

"Fatal error: Tool execution failed

[2018-05-09 04:48:58] Beginning TopHat run (v2.0.9)
-----------------------------------------------
[2018-05-09 04:48:58] Checking for Bowtie
          Bowtie version:    2.1.0.0
[2018-05-09 04:48:58] Checking for Samtools
        Samtools version:    0.1.18.0
[2018-05-09 04:48:58] Checking for Bowtie index files (genome)..
[2018-05-09 04:48:58] Checking for reference FASTA file
[2018-05-09 04:48:58] Generating SAM header for genome
    format:      fastq
    quality scale:   phred33 (default)
[2018-05-09 04:49:00] Reading known junctions from GTF file
[2018-05-09 04:49:05] Preparing reads
     left reads: min. length=101, max. length=101, 93217002 kept reads (14581 discarded)
    right reads: min. length=101, max. length=101, 93211602 kept reads (19981 discarded)
[2018-05-09 05:50:34] Building transcriptome data files..
[2018-05-09 05:50:39] Building Bowtie index from dataset_25063518.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1 "

The gtf file from ensemble was as shown below:-

1   itag    five_prime_UTR  3067813 3068083 .   +   .   Parent=transcript:Solyc01g009060.2.1
1   itag    exon    3067813 3068490 .   +   .   Parent=transcript:Solyc01g009060.2.1;Name=Solyc01g009060.2.1.1;constitutive=1;ensembl_end_phase=2;ensembl_phase=-1;exon_id=Solyc01g009060.2.1.1;rank=1
1   itag    CDS 3068084 3068490 .   +   0   ID=CDS:Solyc01g009060.2.1;Parent=transcript:Solyc01g009060.2.1;protein_id=Solyc01g009060.2.1
1   itag    exon    3071400 3071745 .   +   .   Parent=transcript:Solyc01g009060.2.1;Name=Solyc01g009060.2.1.2;constitutive=1;ensembl_end_phase=0;ensembl_phase=2;exon_id=Solyc01g009060.2.1.2;rank=2
1   itag    CDS 3071400 3071745 .   +   1   ID=CDS:Solyc01g009060.2.1;Parent=transcript:Solyc01g009060.2.1;protein_id=Solyc01g009060.2.1
1   itag    exon    3071993 3072269 .   +   .   Parent=transcript:Solyc01g009060.2.1;Name=Solyc01g009060.2.1.3;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=Solyc01g009060.2.1.3;rank=3
1   itag    CDS 3071993 3072269 .   +   0   ID=CDS:Solyc01g009060.2.1;Parent=transcript:Solyc01g009060.2.1;protein_id=Solyc01g009060.2.1
1   itag    CDS 3074289 3074476 .   +   2   ID=CDS:Solyc01g009060.2.1;Parent=transcript:Solyc01g009060.2.1;protein_id=Solyc01g009060.2.1
1   itag    exon    3074289 3074739 .   +   .   Parent=transcript:Solyc01g009060.2.1;Name=Solyc01g009060.2.1.4;constitutive=1;ensembl_end_phase=-1;ensembl_phase=1;exon_id=Solyc01g009060.2.1.4;rank=4

The annotation file i used was from the below given links:- ftp://ftp.ensemblgenomes.org/pub/release-39/

Because of all these reasons i proceeded with the annotation file obtained from NCBI genome.I got results with gene id of unknown identifier type.Can anyone help to find the valid identifier type for the gene ids such as gene30247,gene30249 etc.I would be grateful if somebody can help me in this regard. Also please let me know why am i not getting the gene names in my cuffdiff results while i have used a valid annotation file which is from NCBI, in the step of cufflinks and the assembled transcripts from cufflinks was given as gtf input to cuffdiff.

Thanks,

Prabha

gene • 3.1k views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 6.0 years ago by adoorprabha • 0

0

Entering edit mode

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

Please stop using Tophat https://t.co/Es4ohxOEyx Cole and I developed the method in *2008*. It was greatly improved in TopHat2 then HISAT & HISAT2. There is no reason to use it anymore. I have been saying this for years yet it has more citations this year than last #methodsmatter
— Lior Pachter (@lpachter) December 2, 2017

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I have used the HISAT pipeline also but then when i run the tool ht-seq count with the same annotation file it shown error.But when i used the assembled transcripts file from stringtie for htseq-count it run successfully and 3 output files were generated which were BAM format file,no feature file and a count data file.I used the count data file for Deseq2 and since i didn't had biological replicates i performed Deseq2 between only the count datas of two different conditions and it shown error.I searched the reason for the error and also reported a bug from galaxy and i did'nt get any response for that.Could you help me figure out the reason for error in Deseq2?

ADD REPLY • link 6.0 years ago by adoorprabha • 0

0

Entering edit mode

I'll post a longer reply later, but to start with:

It's hard to guess what you are asking for
You need to be A LOT more precise about the errors you are getting. I can't see your screen so I can't help if you don't share information
Doing differential expression analysis without replicates is INVALID and USELESS

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I have used the genome.fa for solanum from NCBI genome and the corresponding gff3 file was also downloaded from the same page.

please visit the page:- https://www.ncbi.nlm.nih.gov/genome/?term=solanum+lycopersicum

I cross-checked both the files and the chromosome names are matching.

i have used the older version of Tophat(version 0.7 in usegalaxy.org) for mapping and i have provided the gene annotation model as the gff3 annotation file of solanum from NCBI.

In the next step also i provided the same annotation file and it displayed results with gene shortname and gene_ids with unknown gene identifier type.when i provided the assembled transcripts from cufflinks to cuffdiff it ran successfully but the results did'nt display gene name and gene ids of valid identifier type.

ADD REPLY • link 6.0 years ago by adoorprabha • 0

2

Entering edit mode

Stop.

You appear to be trying to answer a question with a dataset that cannot possibly answer it (due to no replicates). Stop trying endless combinations of tools and simply throw the data away. You will never get anything that will pass peer review from this.

ADD REPLY • link 6.0 years ago by Devon Ryan 104k

0

Entering edit mode

This reply is not relevant here. Wouter asked you completely different questions.

You said DESeq2 gave you errors - what errors exactly?
Wouter says performing DE analysis without replicates is not the right thing to do, but you are doing it anyway. You should address his point and tell him why you chose to go ahead when you don't have data critical to the analysis, and how you're compensating for this protocol change.

If I were better at RNAseq, I'd frame my questions better, but all I know is that there is a communication gap here.

ADD REPLY • link 6.0 years ago by Ram 43k

0

Entering edit mode

Tuxedo protocol 1 is obsolete and tuxedo protocol 2 (HISAT2-stringtie-ballgown) is preferred and suggested protocol by Authors. That being said, check chromosome names in reference fasta and reference gtf. They should match. If they don't, then tophat throws such errors (one of the many reasons for such errors). Could you please post the output of :

$ grep > refererence.fa | head

If they contain chr1, chr2 instead 1 or 2 etc, then chromsome names between reference and gtf are not matching.

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

I have used the genome.fa for solanum from NCBI genome and the corresponding gff3 file was also downloaded from the same page.

please visit the page:- https://www.ncbi.nlm.nih.gov/genome/?term=solanum+lycopersicum

I cross-checked both the files and the chromosome names are matching.

i have used the older version of Tophat(version 0.7 in usegalaxy.org) for mapping and i have provided the gene annotation model as the gff3 annotation file of solanum from NCBI.

In the next step also i provided the same annotation file and it displayed results with gene shortname and gene_ids with unknown gene identifier type.when i provided the assembled transcripts from cufflinks to cuffdiff it ran successfully but the results did'nt display gene name and gene ids of valid identifier type.

ADD REPLY • link 6.0 years ago by adoorprabha • 0

0

Entering edit mode

Don't copy-paste comment-replies to multiple users asking you different questions, that's plain lazy.

ADD REPLY • link 6.0 years ago by Ram 43k

0

Entering edit mode

Hi Ram,

Actually i am sorry, the message was replied by mistake,i intended to reply to cpad0112 and due to some internet problems that went wrong and was sent to WouterDeCoster.Due the whether problems the internet was cut ,now only i saw the mistake.

ADD REPLY • link 6.0 years ago by adoorprabha • 0

0

Entering edit mode

program is failing here (from OP): Building Bowtie index from dataset_25063518.fa. Is this reference file? Also look at cufflinks output (transcripts.gtf).

ADD REPLY • link 6.0 years ago by cpad0112 21k