How to annotate RNA-seq analysis output using the tuxedo suite?
2
1
Entering edit mode
9.0 years ago
blubelle92 ▴ 10

I am working with an Arabidopsis dataset (two controls and 2 treatment, each with three replicates) and I am performing rna-seq analysis on the dataset. the steps performed were as follows

  • building the index using bowtie2
  • followed by aligning the reads to a reference using tophat -o C06_0_NEW -p 20 -G TAIR10.gtf TAIR10.fa At_C06_0.fastq.bz2
  • cufflinks to create transcripts for each replicate (cufflinks -o C06_0_cufflinks -g TAIR10.gtf C06_0_out/accepted_hits.bam)
  • cuffmerge t merge the transcripts of all the replicates into one master transcript (cuffmerge -o cuffmerge_out -s Tindex.fa -g TAIR10.gtf gtf_files.txt)
  • cuffdiff to analyze the deferentially expressed genes (cuffdiff -o C06_cuffdiff_out -b TAIR10.fa -p 20 -L C06_0,C06_1,C06_2 cuffmerge_out/merged.gtf C06_0_out/accepted_hits.bam C06_1_out/accepted_hits.bam C06_2_out/accepted-hits.bam

The problem that I am encountering is that the output of the cuffdiff tool is one of two things

1) a table with no locus annotation for everything but contains values.

or

2) a table with locus annotation but contains no values.

I have tried incorporating the reference annotation into the commands but that seems to have no effect on the output, I have been stuck at this point for the better part of three weeks and I can't seem to find the solution. This is not an assignment or a project for a class, were I am trying to find an easy solution to get the grades, this is my masters thesis work and I could really use some insight, so if anyone could direct me to what I am doing wrong it would be greatly appreciated.

Thank You

RNA-Seq • 4.0k views
ADD COMMENT
0
Entering edit mode

Did you provide the annotated GTF with cuffmerge?

ADD REPLY
0
Entering edit mode

Yes, I just forgot to add the altered command in the question.

The result of adding the annotated gtf file was annotated genes but with no values when run through cuffdiff.

ADD REPLY
0
Entering edit mode

Is that your complete cuffdiff command? If it is, I'd imagine that it's having trouble, because you're telling it to test three sample types in singlet. it should be

blah.gtf condA_rep1.bam,condA_rep2.bam condB_rep1.bam,condB_rep2.bam ... if I remember rightly...

ADD REPLY
0
Entering edit mode
9.0 years ago
cyril-cros ▴ 950

Sorry to hear your troubles, I am really not an expert on Cufflinks but I did a few trials with it. I do have one question: I see 3 samples in your command line, but you say you use two controls and two treatments- did you merge the controls, are they any different from one another?

According to http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html, your command should be along the line of:

cuffdiff <some options> sample1_rep1.bam,sample1_rep2.bam,sample1_rep3.bam ... sample4_rep1.bam,sample4_rep2.bam,sample4_rep3.bam

Depending on the cufflinks version you use, you may wish to use Cuffquant before.

I also don't know if cuffdiff executes successfully in your case; could you tell us if you have have the various .diff files expected in the output, or post some short examples of the outputs and the name of the files you get?

Also, Cole Trapnell (author of Cufflinks) seems to be a nice guy and might be able to help you...

When you are done with cuffdiff, consider using CummeRbund. You can also try some other R packages (but you leave the Tuxedo suite).

Another simple solution is to go to https://usegalaxy.org/ and use the RNA Seq tools - the graphical interface is easy.

ADD COMMENT
0
Entering edit mode
9.0 years ago
blubelle92 ▴ 10

By merging the controls are you referring to the transcripts? The controls and the treatment only differ in the following there is a 6 hour control and a 24 hour control this is true for the treatment both a 6 hour and a 24 hour. The command I included is just an example of what was performed for each of the samples alone. ( so for each sample all three replicates were analyzed separate from the others and then merged using cuffmerge) is that not correct?

cuffdiff does execute and produces outputs such as cds.diff, cds_exp.diff, gene_exp.diff, isoforms_exp.diff, promoters.diff and a variety of other files including count-tracking files, FPKM_tracking files etc.

CummeRbund is the next step that will be performed but doesn't it require some sort of values in the cuffdiff output to be able to perform correctly at this rate there are no values in the files or just plain 0's. seems kinda pointless to me unless you are referring to using the generated counts and taking them to CummeRbund to find the differentially expressed genes directly without using any of the Tuxedo suite?

I appreciate your help.

ADD COMMENT
0
Entering edit mode

I still have some doubt about your experiment: you are comparing control vs treatment after 6 or 24 hour (so time series, not two control plus treatment A plus treatment B)?

As to the cuffdiff output: you should normally get the common gene name in your various files. If you don't cufflinks use its own identifiers ('tracking_id' like gene id, isoform id) for genes and transcripts (look at the format description on the cuffdiff page), which is also present in the annotation I think. You can try to merge these information (using R or databases).

It is still hideously complicated to do by hand. I have never used CummeRbund BUT looking at the manual page 6 and especially 7:

  • "All of these output files are related to each other through their various tracking ids, but parsing through individual files to query for important result information requires both a good deal of patience and a strong grasp of commandline text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multilayered dataset."
  • "We now also recommend that you use both the genome and gtfFile arguments to readCufflinks(). This will allow cummeRbund to archive the transcript structure information located in the .gtf file associated with your particular cuffdiff run, as well as associate these transcripts with an appropriate genome build (e.g. 'hg19', 'mm9', etc) so as to allow for transcript-level visualizations and future integration with other external resources."

So just create a R script, after installing cummeRbund:

library(cummeRbund)
cuff<-readCufflinks(dir="/home/blubelle92/thePlaceWhereYourCuffdiffOutputIs",gtfFile="cuffmerge_out/merged.gtf",genome="TAIR10") # I know nothing on plant genetics, use the correct genome

I have the same type of issues with my own master's thesis, and juggling between a reference annotation using Ensembl gene ids, a custom made one using custom ids and genes with various official and unofficial names is just plain tedious. Cufflinks is practical and does a lot of work for you, which is sometimes a problem. Don't trust blindly the way it counts gene expression levels or isoforms, check your data using IGV.

ADD REPLY

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6