Tophat Protocol Error Occured At Step 4
0
0
Entering edit mode
10.7 years ago
ashis.csedu ▴ 100

Hi all, I was doing differential gene and transcript expression analysis by following the Tophat protocol (published in Nature protocol in March 2012) on the same example D. melanogaster as it is mentioned in the paper. But I am confused of an error/warning message at step 4: Running Cuffmerge on all the assemblies to create a single merged transcriptome annotation using the command:

cuffmerge -g genes.gtf -s genome.fa -p 2 assemblies.txt


Here, "genes.gtf" was collected from iGenome package of Ensembl_BDGP5.25 of the species. The reference genome "genome.fa" was also from the same package. And I was using a 2-core processor, so I limitted my thread spawning to 2 at most. And the "assemblies.txt" are the list of GTF files produced at step 2 by cufflinks on the mentioned six experiments.

While executing the above command, the following error/warning message is shown and after 2/3 minutes, it completes where it supposed to be completed in almost 4 hours in my machine. That's why I was thinking that the error must be solved before going to the next step.

[Tue May 29 11:13:43 2012] Beginning transcriptome assembly merge
-------------------------------------------

[Tue May 29 11:13:43 2012] Preparing output location ./merged_asm/
[Tue May 29 11:13:46 2012] Converting GTF files to SAM
[Tue May 29 11:13:51 2012] Quantitating transcripts
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.0 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
Command line:
cufflinks -o ./merged_asm/ -F 0.05 -g genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 2 ./merged_asm/tmp/mergeSam_fileCA810u
File ./merged_asm/tmp/mergeSam_fileCA810u doesn't appear to be a valid BAM file, trying SAM...
[11:13:53] Inspecting reads and determining fragment length distribution.
Processed 11320 loci.
> Map Properties:
>    Total Map Mass: 67742.00
>    Fragment Length Distribution: Truncated Gaussian (default)
>                  Default Mean: 200
>               Default Std Dev: 80
[11:13:54] Assembling transcripts and estimating abundances.
Processed 11320 loci.
[Tue May 29 11:15:43 2012] Comparing against reference file genes.gtf
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.0 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
Warning: couldn't find fasta record for '2LHet'!
Warning: couldn't find fasta record for '2RHet'!
Warning: couldn't find fasta record for '3LHet'!
Warning: couldn't find fasta record for '3RHet'!
Warning: couldn't find fasta record for 'U'!
Warning: couldn't find fasta record for 'XHet'!
Warning: couldn't find fasta record for 'YHet'!
Warning: couldn't find fasta record for 'dmel_mitochondrion_genome'!
[Tue May 29 11:15:52 2012] Comparing against reference file genes.gtf  ..........


I am not sure why it could not find the fasta record for "2LHet", "2RHet", etc. What do you think would be the problem in my case? Do I have to do anything with the "genome.fa" to solve this?
Thanks.

tophat rna-seq • 6.5k views
2
Entering edit mode

Most obvious thing to first check is to see if the name of the references sequences are different between the fasta file and gtf file. For example does the fasta file have a 2LHet header?

0
Entering edit mode

Thanks for your answer. I looked for the headers in the "genome.fa" and got 7 sequences having headers - 2L,2R,3L,3R,4,M,X. And also I looked in to the "gene.gtf" and found that 46265 entries have header "2L", but only 89 "2LHet". Similarly, there are 60521 "2R" entries, and 470 "2RHet" entries. Cuffmerge did not fail entirely due to the absence of those headers, it actually skipped those entries and showed me warnings, that's all! I did not check at the first time whether it actually did produce the "merged.gtf", but now checked, it certainly did. Thanks Dk for the help. Now I'm doing step 5.

0
Entering edit mode

Glad to see this, because I was trying to run the files from that paper too. I had a question, why in procedure 1,

tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq
tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq ＃why this is C2_R1_1.fq C1_R1_2.fq not C2_R1_1.fq C2_R1_2.fq, I thought this is a paired end dataset, so the first three commands above seems right, but I do not understand this command, and the two below. help me, please!
tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq
tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

0
Entering edit mode

I think you are right. I changed the C1 into C2 for the last three lines.

0
Entering edit mode

I got the same error in tophat protocol error occurred at step 4, I knew the name of the references sequences are different between the fasta file and gtf file. Should I delete the 2LHet header and so on in the gtf file?