Question: How to annotate RNA-seq analysis output using the tuxedo suite?
1
gravatar for blubelle92
5.7 years ago by
blubelle9210
Egypt
blubelle9210 wrote:

I am working with an arodopsis dataset ( two controls and 2 treatment, each with three replicates)and i am performing rna-seq analysis on the dataset. the steps performed were as follows

-building the index using bowtie2

-followed by aligning the reads to a reference using

tophat -o C06_0_NEW -p 20 -G TAIR10.gtf TAIR10.fa At_C06_0.fastq.bz2

-cufflinks to create transcripts for each replicate

cufflinks -o C06_0_cufflinks -g TAIR10.gtf C06_0_out/accepted_hits.bam 

-cufmerge t merge the transcripts of all the replicates into one master transcript

cuffmerge -o cuffmerge_out -s Tindex.fa -g TAIR10.gtf gtf_files.txt

-cuffdiff to analyze the deferentially expressed genes 

    cuffdiff -o C06_cuffdiff_out -b TAIR10.fa -p 20 -L C06_0,C06_1,C06_2 cuffmerge_out/merged.gtf C06_0_out/accepted_hits.bam C06_1_out/accepted_hits.bam C06_2_out/accepted-hits.bam

 

The problem that i am encountering is that the output of the cuffdiff tool is one of two things

1) a table with no locus annotation for everything but contains vallues. 

or

2) a table with locus annotation but  contains no values.

I have tried encorporating the reference annotation into the commands but that seems to have no effect on the output, i have been stuck  at this point for the better part of three weeks and i can't seem to find the solution. this is not an assignment or a project for a class, were i am trying to find an easy solution to get the grades, this is my masters thesis work and i could  really use some insight, so if anyone could direct me to what i am doing wrong it would be greatly apppreciated.

ThankYou.

 

rna-seq • 3.2k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by blubelle9210

Did you provide the annotated GTF with cuffmerge? 

ADD REPLYlink written 5.7 years ago by andrew.j.skelton736.1k

yes, i just forgot to add the altered command in the question.

the result of adding the annotated gtf file was annotated genes but with no values when run through cuffdiff.

ADD REPLYlink written 5.7 years ago by blubelle9210

is that your complete cuffdiff command? If it is, I'd imagine that it's having trouble, because you're telling it to test three sample types in singlet. it should be

blah.gtf condA_rep1.bam,condA_rep2.bam condB_rep1.bam,condB_rep2.bam ... if I remember rightly... 

ADD REPLYlink written 5.7 years ago by andrew.j.skelton736.1k
0
gravatar for cyril-cros
5.7 years ago by
cyril-cros910
France
cyril-cros910 wrote:

Sorry to hear your troubles, I am really not an expert on Cufflinks but I did a few trials with it. I do have one question: I see 3 samples in your command line, but you say you use two controls and two treatments- did you merge the controls, are they any different from one another?
According to http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html, your command should be along the line of:

cuffdiff <some options> sample1_rep1.bam,sample1_rep2.bam,sample1_rep3.bam ... sample4_rep1.bam,sample4_rep2.bam,sample4_rep3.bam 

Depending on the cufflinks version you use, you may wish to use Cuffquant before.

I also don't know if cuffdiff executes successfully in your case; could you tell us if you have have the various .diff files expected in the output, or post some short examples of the outputs and the name of the files you get?
Also, Cole Trapnell (author of Cufflinks) seems to be a nice guy and might be able to help you...

When you are done with cuffdiff, consider using CummeRbund. You can also try some other R packages (but you leave the Tuxedo suite).

Another simple solution is to go to https://usegalaxy.org/ and use the RNA Seq tools - the graphical interface is easy.

 

ADD COMMENTlink written 5.7 years ago by cyril-cros910
0
gravatar for blubelle92
5.7 years ago by
blubelle9210
Egypt
blubelle9210 wrote:

by merging the controls are you referring to the transcripts??  the controls and the treatment only differ in the following there is a 6 hour control and a 24 hour control this is true for the treatment both a 6 hour and a 24 hour. the  command i included is just an example of what was performed for each of the samples alone. ( so for each sample all three replicates were analyzed seperate from the others and then merged using cuffmerge) is that not correct??

cuffdiff does execute and produces outputs such as cds.diff,  cds_exp.diff,  gene_exp.diff , isoforms_exp.diff,  promoters.diff and a variety of other files including count-tracking files, FPKM_tracking files etc.

CummeRbound is the next step that will be performed but doesnt it require some sort of values in the cuffdiff output to be able to perform correctly at this rate there are no values in the files or just plain 0's. seems kinda pointless to me unless you are referring to using the generated counts and taking them to CummeRbound to find the differentially expressed genes directly without using any of the Tuxedo suite?

I appreciate your help.

 

ADD COMMENTlink written 5.7 years ago by blubelle9210

I still have some doubt about your experiment: you are comparing control vs treatment after 6 or 24 hour (so time series, not two control plus treatment A plus treatment B)?

As to the cuffdiff output: you should normally get the common gene name in your various files. If you don't cufflinks use its own identifiers ('tracking_id' like gene id, isoform id) for genes and transcripts (look at the format description on the cuffdiff page), which is also present in the annotation I think. You can try to merge these informations (using R or databases).

It is still hideously complicated to do by hand. I have never used CummeRbund BUT looking at the manual http://www.bioconductor.org/packages/3.0/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf page 6 and especially 7:

  • "All of these output files are related to each other through their various tracking ids, but parsing through individual files to query for important result information requires both a good deal of patience and a strong grasp of commandline text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multilayered dataset."
  • "We now also recommend that you use both the genome and gtfFile arguments to readCufflinks(). This will allow cummeRbund to archive the transcript structure information located in the .gtf file associated with your particular cuffdiff run, as well as associate these transcripts with an appropriate genome build (e.g. ’hg19’, ’mm9’, etc) so as to allow for transcript-level visualizations and future integration with other external resources."

So just create a R script, after installing cummeRbund:

library(cummeRbund)
cuff<-readCufflinks(dir="/home/blubelle92/thePlaceWhereYourCuffdiffOutputIs",gtfFile="cuffmerge_out/merged.gtf",genome="TAIR10") # I know nothing on plant genetics, use the correct genome

I have the same type of issues with my own master's thesis, and juggling between a reference annotation using Ensembl gene ids, a custom made one using custom ids and genes with various official and unofficial names is just plain tedious. Cufflinks is practical and does a lot of work for you, which is sometimes a problem. Don't trust blindly the way it counts gene expression levels or isoforms, check your data using IGV.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by cyril-cros910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2152 users visited in the last hour
_