Question

After run Cufflinks - cuffdiff there is no information about isoforms, TSS, CDS...

0

Entering edit mode

10.0 years ago

Paul ★ 1.5k

Dear all,

I am working on my RNA-seq data and when I following procotcol "Differential gene and transcription expression analysis of RNA-seq experiment with TopHat nad Cufflinks" I have problem after cuffdiff my samples:

When I wanna check information about cuffdiff in R - CummeRBund - there are zero values :

> cuff
CuffSet instance with:
     12 samples
     25834 genes
     0 isoforms
     0 TSS
     0 CDS
     0 promoters
     0 splicing
     0 relCDS

Why do I have zeros? I have trouble with plotting some graphs also.

Here is my syntax:

./cuffdiff -o all_samples_diff_prochazka \
    -L 159,160,165,185,207,35,65,75,91,93,155,156 \
    -b /home/iab/bundle/2.8/hg19/ucsc.hg19.fasta \
    -u all_samples_merged_prochazka/merged.gtf \
    -p 8 \
    --library-type fr-unstranded \
    -c 5 \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/159_tpout_hg19/159_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/160_tpout_hg19/160_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/165_tpout_hg19/165_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/185_tpout_hg19/185_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/207_tpout_hg19/207_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/35_tpout_hg19/35_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/65_tpout_hg19/65_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/75_tpout_hg19/75_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/91_tpout_hg19/91_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/93_tpout_hg19/93_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/155_tpout_hg19/155_accepted_hits.bam \
    /home/iab/rna_analysis/tophat-2.0.11.Linux_x86_64/156_S7_tpout/accepted_hits.156_S7.bam

Thank you for any correction or ideas.

Paul

RNA-Seq cummeRBund Cufflinks • 6.3k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by Paul ★ 1.5k

2

Entering edit mode

10.0 years ago

andrew.j.skelton73 6.5k

It's all good, it's a tough learning curve! So first and foremost it sounds as if it's your annotation that's giving you the trouble, so I'd download the igenomes Ensembl archive (linked previously), and take out the fasta file for the human genome, the bowtie2 index and the gene track (GTF file). That's all you'll need for this.

Make sure you've got the latest binaries installed for the tuxedo package (I don't think you need to run tophat2 anymore, just tophat).

Cufflinks - if you're looking for novel isoforms and transcripts then you need --GTF-guide on, -G just maps your transcripts to the reference.

Cuffmerge - The issue is probably because you've put in assembles.txt twice. That text file is the last parameter of the file and the program may get confused by any parameters afterwards. So take out the first assemblies.txt and that should work.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Thank you so much Andrew for your response and help!! :-) I am very appreciate it! So I will try it again with the iGenome sources!

Paul.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Paul ★ 1.5k

1

Entering edit mode

10.0 years ago

Martombo ★ 3.1k

from cufflinks manual:

Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are: tss_id, p_id.

Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:

cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf

so make sure that your GTF file has got the attributes fields you need. if it doesn't, download the one suggested by andrew.j.skelton73 or produce your own.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Martombo ★ 3.1k

0

Entering edit mode

10.0 years ago

andrew.j.skelton73 6.5k

Did you use the same reference genome throughout? Which one did you use? What species is this for?

ADD COMMENT • link 10.0 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Hi Andrew,

yeah I did use the same reference genome Hg19 from ucsc. I am also use merged all gtf reference files (you can see on my syntax above). I want compare expression level between two samples (one is healthy and one has cancer). Maybe found some novel isoforms and splicing and gene fusion. I am focusing on MLH1 gene (which is on 3 chromosome).

Paul.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

Paul, If I were you, I'd run it against the Ensembl reference genome, especially for human, I've found that it's got much better annotation. You can also download the igenomes pack which contains everything you need for running the tuxedo pipeline. https://support.illumina.com/sequencing/sequencing_software/igenome.ilmn

Especially if you're looking for novel transcripts I'd use Ensembl. Feel free to put your tophat / cufflinks commands up for us to take a look at. How many replicates do you have?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Thank you Andrew for sharing your experience. Actually I am pretty newbie in RNA-seq analysis. I was also try use reference GRCh37, because they have better naming of gene (ucsc refseq is just XLOC_number for gene naming and I am prefer gene name - MLH1). But after run TopHat and Cufflinks I had issue with cuffmerge - here is my error log:

[Mon Apr 28 13:55:49 2014] Beginning transcriptome assembly merge
-------------------------------------------

[Mon Apr 28 13:55:49 2014] Preparing output location merged_GRCh37_155_156/
[Mon Apr 28 13:56:04 2014] Converting GTF files to SAM
[13:56:04] Loading reference annotation.
Error: duplicate GFF ID 'ENST00000361547' encountered!
    [FAILED]
Error: could not execute gtf_to_sam

This I was not able to find solution for this bug :-( So I was continue with hg19. So probably I should download from iGenomes reference gene and gtf file and make all pipeline again.. What do you mean replicates? And where can I find it?

HERE IS MY PIPELINE FOR ucsc.hg19 reference genome and ucsc refgene gtf file:

./tophat2 -o tp_out --library-type fr-unstranded -r 260 --fusion-search -G /home/refgene.gtf -p8 --b2-very-sensitive /home/ucsc.hg19.fasta /home/R1_001.fastq /home/R2_001.fastq --rg-id Sample --rg-sample sample --rg-library rna-seq --rg-platform Illumina
./cufflinks -p 8 --library-type fr-unstranded -o cufflinks_out_gtf -G /home/refgene.gtf /home/tophat-/accepted_hits.bam

This I repeat for all samples.

./cuffmerge -o all_samples_merged -s /home/ucsc.hg19.fasta assemblies.txt -g /home/refgene.gtf -p 8 assemblies.txt

And finally cuffdiff.

./cuffdiff -o all_samples_diff -L 1,2,3,4,5,6,7,8,9,10 -b /home/ucsc.hg19.fasta -u all_samples_merged/merged.gtf -p 8 --library-type fr-unstranded -c 5 PATH/TO/ALL/MY/BAMS

And finally I run R - CummeRbund and have issue explained above . Sorry maybe for stupid questions.

Paul.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Paul ★ 1.5k

Ram · Accepted Answer · 2014-05-14

Hi all,

finally the solution was easy - thanks to Loyal A. Goff, who helped me with probably stupid issue.

The issue here is that your sample names are integers and R will not allow integers to be used as column names in a data.frame. The easiest solution is to re-run your cuffdiff analysis using labels for your samples that do not start with a numeric value (0-9).

So when you naming your samples - do not use just numbers - because some output from cummeRbund will not works!

Paul.