Question: Tophat Protocol Error Occured At Step 4
0
gravatar for ashis.csedu
6.9 years ago by
ashis.csedu80
United States
ashis.csedu80 wrote:

Hi all, I was doing differential gene and transcript expression analysis by following the Tophat protocol (published in Nature protocol in March 2012) on the same example D. melanogaster as it is mentioned in the paper. But I am confused of an error/warning message at step 4: Running Cuffmerge on all the assemblies to create a single merged transcriptome annotation using the command:

cuffmerge -g genes.gtf -s genome.fa -p 2 assemblies.txt

Here, "genes.gtf" was collected from iGenome package of Ensembl_BDGP5.25 of the species. The reference genome "genome.fa" was also from the same package. And I was using a 2-core processor, so I limitted my thread spawning to 2 at most. And the "assemblies.txt" are the list of GTF files produced at step 2 by cufflinks on the mentioned six experiments.

While executing the above command, the following error/warning message is shown and after 2/3 minutes, it completes where it supposed to be completed in almost 4 hours in my machine. That's why I was thinking that the error must be solved before going to the next step.

[Tue May 29 11:13:43 2012] Beginning transcriptome assembly merge
-------------------------------------------

[Tue May 29 11:13:43 2012] Preparing output location ./merged_asm/
[Tue May 29 11:13:46 2012] Converting GTF files to SAM
[11:13:46] Loading reference annotation.
[11:13:46] Loading reference annotation.
[11:13:47] Loading reference annotation.
[11:13:48] Loading reference annotation.
[11:13:49] Loading reference annotation.
[11:13:49] Loading reference annotation.
[Tue May 29 11:13:51 2012] Quantitating transcripts
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.0 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
Command line:
cufflinks -o ./merged_asm/ -F 0.05 -g genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 2 ./merged_asm/tmp/mergeSam_fileCA810u 
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File ./merged_asm/tmp/mergeSam_fileCA810u doesn't appear to be a valid BAM file, trying SAM...
[11:13:51] Loading reference annotation.
[11:13:53] Inspecting reads and determining fragment length distribution.
Processed 11320 loci.                       
> Map Properties:
>    Total Map Mass: 67742.00
>    Fragment Length Distribution: Truncated Gaussian (default)
>                  Default Mean: 200
>               Default Std Dev: 80
[11:13:54] Assembling transcripts and estimating abundances.
Processed 11320 loci.                       
[Tue May 29 11:15:43 2012] Comparing against reference file genes.gtf
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.0 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
Warning: couldn't find fasta record for '2LHet'!
Warning: couldn't find fasta record for '2RHet'!
Warning: couldn't find fasta record for '3LHet'!
Warning: couldn't find fasta record for '3RHet'!
Warning: couldn't find fasta record for 'U'!
Warning: couldn't find fasta record for 'XHet'!
Warning: couldn't find fasta record for 'YHet'!
Warning: couldn't find fasta record for 'dmel_mitochondrion_genome'!
[Tue May 29 11:15:52 2012] Comparing against reference file genes.gtf  ..........

I am not sure why it could not find the fasta record for "2LHet", "2RHet", etc. What do you think would be the problem in my case? Do I have to do anything with the "genome.fa" to solve this?
Thanks.

tophat rna-seq • 5.2k views
ADD COMMENTlink modified 4.1 years ago by jhluo21660 • written 6.9 years ago by ashis.csedu80
2

Most obvious thing to first check is to see if the name of the references sequences are different between the fasta file and gtf file. For example does the fasta file have a 2LHet header?

ADD REPLYlink written 6.9 years ago by Damian Kao15k

Thanks for your answer. I looked for the headers in the "genome.fa" and got 7 sequences having headers - 2L,2R,3L,3R,4,M,X. And also I looked in to the "gene.gtf" and found that 46265 entries have header "2L", but only 89 "2LHet". Similarly, there are 60521 "2R" entries, and 470 "2RHet" entries. Cuffmerge did not fail entirely due to the absence of those headers, it actually skipped those entries and showed me warnings, that's all! I did not check at the first time whether it actually did produce the "merged.gtf", but now checked, it certainly did. Thanks Dk for the help. Now I'm doing step 5.

ADD REPLYlink written 6.9 years ago by ashis.csedu80

glad to see this, because i was trying to run the files from that paper too. i had a question, why in procedure 1,

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq

$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq

$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq #why this is C2_R1_1.fq C1_R1_2.fq not C2_R1_1.fq C2_R1_2.fq, i thought this is a pairend dataset, so the first three commands above seems right, but i do not understand this command, and the two below. help me, please!

$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq

$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

ADD REPLYlink written 5.9 years ago by mad.cichlids100

I think you are right. I changed the C1 into C2 for the last three lines.

ADD REPLYlink written 5.9 years ago by henryshm1220
0
gravatar for jhluo2166
4.1 years ago by
jhluo21660
jhluo21660 wrote:

I got the same error in tophat protocol error occored at step 4,I knew the name of the references sequences are different between the fasta file and gtf file,Should I delete the 2LHet header and so on in the gtf file?

 

ADD COMMENTlink written 4.1 years ago by jhluo21660
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 690 users visited in the last hour