Hello,
I need to create the Tophat reference index of the human genome. I therefore downloaded the GTF file (humRef.gtf) from Ensembl and used Bowtie to create an index of the human genome with a single fasta file (dna.toplevel) with all human chromsomes, which I renamed humRef.fa, as follows:
bowtie2-build -f ./humRef.fa humRef
this created the following files: humRef.1.bt2, humRef.2.bt2, humRef.3.bt2, humRef.4.bt2, humRef.rev.1.bt2 and humRef.rev.2.bt2.
I ran tophat with the following command:
tophat2 -G humRef.gtf --transcriptome-index=humRef_tr humRef
but I got the following error:
[2016-03-30 20:05:55] Building transcriptome files with TopHat v2.1.1
-----------------------------------------------
[2016-03-30 20:05:55] Checking for Bowtie
Bowtie version: 2.2.6.0
[2016-03-30 20:06:44] Checking for Bowtie index files (genome)..
[2016-03-30 20:06:44] Checking for reference FASTA file
[2016-03-30 20:06:44] Building transcriptome data files humRef_tr/humRef
[2016-03-30 20:09:02] Building Bowtie index from humRef.fa
[FAILED]
Error: Couldn't build bowtie index with err = 1
Could you please tell me what is wrong? It must be a syntax issue but I don't know which one. I tried to provide paths to the files but nothing, and also changing the names of the index files. All files are in the same directory.
I also tried to remove the .fa extension of humRef.fa file, following a suggestion from internet, but the answer was:
[2016-03-30 20:30:49] Checking for reference FASTA file
Warning: Could not find FASTA file humRef.fa
[2016-03-30 20:30:49] Reconstituting reference FASTA file from Bowtie index
a solution that I don't think is efficient since the file is available
Thank you,
L
The syntax of your special TopHat run command is correct. This problem may have been created by the unfortunate choice of using the prefix of the fasta file name as your genome index name. Are all these files in the current working directory? Are you running all commands from a same user account?
Actually I started by separating the files by suffices, so I had the humRef.fa for the reference fasta, then _idx for the Bowtie's index and _tr for the tophat output, but I got the same error, so I thought the files had to be with the same names in order to be recognized by tophat. The files are all in the same directory and run with the same user, actually one command after the other.
I think TopHat expects the transcriptome index to be in a separate directory, so possibly you should provide a directory name as well as a prefix name for the transcriptome index files.
But I also tried
tophat2 -G ./humRef.gtf --transcriptome-index=humRef_tr ./humRef
to tell that the files were in the working directory, unsuccessful. I copied all the index/fa/gtf files into a humRef directory so that tophat could find the fasta file wherever it wanted but the error was the same.Can you try creating the transcriptome index somewhere else. Like this
I tried with
... -index=./top/humRef_tr ...
and the directory top was created, inside the usualhumRef_th.fa humRef_th.fa.tlst humRef_th.gff humRef_th.ver
files were created but then theBuilding Bowtie index from humRef_th.fa [FAILED]
appeared.I think part of the problem is that if everything works correctly you should end up with a bowtie index for the genome AND a different bowtie index for the transcriptome (and also a fasta file for the genome AND another fasta for the transcriptome), and your naming everything humRef just makes things very confusing.
So is the syntax for Bowtie wrong? Shouldn't the transcriptome be provided in the special gtf format, which is also the first argument of the function? Might the problem be present in the fasta file itself? What should be the structure? Should the heading have ">humRef" or is that not relevant?
Yes, the transcriptome is provided in the gtf file. Bowtie/Tophat then makes a bowtie2 index of the transcriptome, and also creates other files including a fasta file of the transcriptome.
I'm sure that your dna.toplevel.fa file of the genome will have proper fasta header lines (lines starting with > ) for each of the chromosome sequences/contigs in the file. And each header line should have the name of one of the chromosomes/contigs.
I also tried to follow the question
from Biostar and applied
sed 's/^chr//g' annotation.gtf > annotation.2.gtf
but I got the again error 1.Don't change the names of the chromosomes in any file after creating the index. This can only lead to more trouble.