Empty cufdiff output
2
0
Entering edit mode
4.7 years ago
Lindsay44 ▴ 20

I am trying to use Cuffdiff to analyze differential expression between two experimental groups. I used cufflinks and cuffmerge to produce the gtf assembly. When I run cuffdiff though I keep on getting empty output files. No error message comes up either.

This is the command I am running:

cuffdiff path_to_gtf path_to_data1 path_to_data2

How should I modify this command to get this to run properly?

I am only using one file for each sample group until I get the program to work. How do I incorporate the rest of the files as well?

The gene annotation file used in Cuffmerge has the p_id and tss_id. Do i have to include the gene annotation file in Cufflinks as well (as opposed to only cuffmerge)?

RNA-Seq Assembly • 1.6k views
0
Entering edit mode

0
Entering edit mode

cuffdiff c:/research/Cuffmerge/merged_asm/merged.gtf $T1$N1

T1=c:/research/alignment_data/TB1_alignment_sorted.sam

N1=c:/research/alignment_data/NT1_alignment_sorted.sam

0
Entering edit mode

Where do you specify the output directory? such as this...

cuffdiff -o cuffdiff_out path_to_gtf path_to_data1 path_to_data2

0
Entering edit mode

I just added cuffdiff_out and it produces empty output files within that folder. It seems like the data files aren't being picked up. How can I fix the command?

0
Entering edit mode

Can you clarify please: 1) you are getting output files but they contain no data or...2) You are getting no output files?

Also, can you copy and paste here the resulting terminal output.

Thanks.

0
Entering edit mode

I am getting output files that contain no data. The terminal output is one of the following (most often the first):

You are using Cufflinks v2.2.1, which is the most recent release.

OR

You are using Cufflinks v2.2.1, which is the most recent release. Error: cuffdiff requires at least 2 SAM files c:/research/alignment_data/TB1_alignment_sorted.sam: line 1: @HD: command not found c:/research/alignment_data/TB1_alignment_sorted.sam: line 2: @SQ: command not found c:/research/alignment_data/TB1_alignment_sorted.sam: line 3: @SQ: command not found

...

What can be going wrong?

Thank you.

0
Entering edit mode

Are you using transcript.gtf file and genome.fasta file for the same version of the genome? Is GTF file your merged.gtf? Did you use same version of the genome.fasta to generate sam files? Do you see output in cufflinks and cuffmerge?

0
Entering edit mode

Both the transcript.gtf file and genome.fa file were included in the same reference genome download (mm10) so I am assuming it is the same version of the genome. I have run the script including both the merged.gtf and the reference transcript.gtf, but I also ran it with just the prior. I used the same fasta file in Hisat2 to create the reference index.

I am able to get an output from cufflinks and cuffmerge.

Does it make a difference that I didn't include any of the reference information while making the assemblies in cufflinks and that I only included the reference index (as opposed to including the gtf file as well) in the alignment?

1
Entering edit mode
4.7 years ago
YaGalbi ★ 1.5k

Well cuff diff is running correctly so thats a start. It appears the problem is that cuffdiff can not find the first sam file you specified (T1) so exits with an error before looking for the second sam file. This is probably a very small syntax error either in the command or the path to the files.

I find that simplifying the command often leads you to where the problem is.

First: In your command, remove the labelling to see if that fixes the problem. If it does, then go figure out how to label properly.

So the command would now be:

cuffdiff c:/research/Cuffmerge/merged_asm/merged.gtf c:/research/alignment_data/TB1_alignment_sorted.sam c:/research/alignment_data/NT1_alignment_sorted.sam


Second: Place all 3 files (merged.gtf, TB1_alignment_sorted.sam, NT1_alignment_sorted.sam) into the same folder. Open a terminal in that folder. Then run the command but without the long paths. If this works, then the problem is with the long paths you entered...so go figure out how to do that properly.

So the command also without labels would now be very simple:

cuffdiff merged.gtf TB1_alignment_sorted.sam NT1_alignment_sorted.sam

0
Entering edit mode

That makes a lot of sense and it should work; however it looks like it isn't.

I ran the command without the labels The output is a set of empty files and the following command line message:

You are using Cufflinks v2.2.1, which is the most recent release.

Then I ran the script without the paths or labels and got this error message:

You are using Cufflinks v2.2.1, which is the most recent release. open: No such file or directory Error: cannot open file TB1_alignment_sorted.sam for reading. Unrecognized file type

I think the files are sorted correctly though:

To convert SAM file (output of alignment) to a sorted BAM file: samtoools view -bS filename.sam | samtools sort -o filename

To convert the sorted BAM file back to a SAM file: samtools view -h filename.bam > filename.sam

Does it matter that I am not using a reference file in cufflinks or cuffdiff?

1
Entering edit mode

Does it matter that no ref file is used? - I dont think so. That is how one would try to find novel transcripts - however it has been suggested on this forum that while possible, it isnt a very statistically reliable method.

Im wondering why you think it is necessary to convert the sorted bam file back into a sam file? Cuffdiff handles bam files just fine. maybe try without this step to see what happens. Something is clearly wrong with the ability to read those sam files.

Is your version of cufflinks running cuffcompare as a final step? I think cuffcompare adds some columns to thte output files that are needed for cuffdiff.

0
Entering edit mode

It worked with the sorted BAM files.

Novel genes can still be found with a reference file through, right?

Thanks.

0
Entering edit mode

Nice one!

Bear in mind i'm also new to this pipeline - i'm only expressing my understanding from multiple forum posts here after recently completing a RNAseq project for an MSc course - I suggest contacting the user genomax2 for the real theoretical stuff

I think finding novel genes with a reference genome depends on how complete the reference genome is. If the reference genome is already well annotated (e.g. rat, human) then it is unlikely.

I get the impression that this is somewhat like a jigsaw puzzle with parts of the guiding reference image missing. The more COMPLETE the reference image, the easier it is to place each jigsaw piece correctly, and the less likely you will have pieces left over at the end. The more INCOMPLETE the reference image, the more likely there will be new combinations of pieces (novel transcripts) and also that MANY of those will be mistakes (false discovery rate) so need to be validated externally.

My (limited) understanding of this is that cufflinks finds the most parsimonious assembly for all reads. Therefore if no reference annotation is supplied to cufflinks, then there is a chance of identifying transcripts that would usually be excluded if there was a reference annotation supplied.

The danger is that these novel transcripts are interpreted as reliable, when actually they are merely another set of possiblilites. By doing it this way, there is apparently a high false discovery rate. This is considered so unreliable that any attempt at publication will be rejected unless there is overwhelming supporting evidence from other novel transcript discovery methods, or corroborative evidence from other resources (e.g. ESTs etc).