Question

Tophat Building transcriptome files synthax

0

Entering edit mode

8.6 years ago

marongiu.luigi ▴ 710

Hello,
I need to create the Tophat reference index of the human genome. I therefore downloaded the GTF file (humRef.gtf) from Ensembl and used Bowtie to create an index of the human genome with a single fasta file (dna.toplevel) with all human chromsomes, which I renamed humRef.fa, as follows:

bowtie2-build -f ./humRef.fa humRef

this created the following files: humRef.1.bt2, humRef.2.bt2, humRef.3.bt2, humRef.4.bt2, humRef.rev.1.bt2 and humRef.rev.2.bt2.
I ran tophat with the following command:

tophat2 -G humRef.gtf --transcriptome-index=humRef_tr humRef

but I got the following error:

[2016-03-30 20:05:55] Building transcriptome files with TopHat v2.1.1
-----------------------------------------------
[2016-03-30 20:05:55] Checking for Bowtie
          Bowtie version:    2.2.6.0
[2016-03-30 20:06:44] Checking for Bowtie index files (genome)..
[2016-03-30 20:06:44] Checking for reference FASTA file
[2016-03-30 20:06:44] Building transcriptome data files humRef_tr/humRef
[2016-03-30 20:09:02] Building Bowtie index from humRef.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1

Could you please tell me what is wrong? It must be a syntax issue but I don't know which one. I tried to provide paths to the files but nothing, and also changing the names of the index files. All files are in the same directory.
I also tried to remove the .fa extension of humRef.fa file, following a suggestion from internet, but the answer was:

[2016-03-30 20:30:49] Checking for reference FASTA file
    Warning: Could not find FASTA file humRef.fa
[2016-03-30 20:30:49] Reconstituting reference FASTA file from Bowtie index

a solution that I don't think is efficient since the file is available
Thank you,
L

rna-seq Assembly • 6.4k views

ADD COMMENT • link updated 6.5 years ago by zx8754 12k • written 8.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

The syntax of your special TopHat run command is correct. This problem may have been created by the unfortunate choice of using the prefix of the fasta file name as your genome index name. Are all these files in the current working directory? Are you running all commands from a same user account?

ADD REPLY • link 8.6 years ago by GenoMax 146k

0

Entering edit mode

Actually I started by separating the files by suffices, so I had the humRef.fa for the reference fasta, then _idx for the Bowtie's index and _tr for the tophat output, but I got the same error, so I thought the files had to be with the same names in order to be recognized by tophat. The files are all in the same directory and run with the same user, actually one command after the other.

ADD REPLY • link 8.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

I think TopHat expects the transcriptome index to be in a separate directory, so possibly you should provide a directory name as well as a prefix name for the transcriptome index files.

ADD REPLY • link 8.6 years ago by mastal511 ★ 2.1k

0

Entering edit mode

But I also tried tophat2 -G ./humRef.gtf --transcriptome-index=humRef_tr ./humRef to tell that the files were in the working directory, unsuccessful. I copied all the index/fa/gtf files into a humRef directory so that tophat could find the fasta file wherever it wanted but the error was the same.

ADD REPLY • link 8.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Can you try creating the transcriptome index somewhere else. Like this

$ tophat2 -G ./humRef.gtf --transcriptome-index=/some_other_path/humRef_tr ./humRef

ADD REPLY • link 8.6 years ago by GenoMax 146k

0

Entering edit mode

I tried with ... -index=./top/humRef_tr ... and the directory top was created, inside the usual humRef_th.fa humRef_th.fa.tlst humRef_th.gff humRef_th.ver files were created but then the Building Bowtie index from humRef_th.fa [FAILED] appeared.

ADD REPLY • link 8.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

I think part of the problem is that if everything works correctly you should end up with a bowtie index for the genome AND a different bowtie index for the transcriptome (and also a fasta file for the genome AND another fasta for the transcriptome), and your naming everything humRef just makes things very confusing.

ADD REPLY • link 8.6 years ago by mastal511 ★ 2.1k

0

Entering edit mode

So is the syntax for Bowtie wrong? Shouldn't the transcriptome be provided in the special gtf format, which is also the first argument of the function? Might the problem be present in the fasta file itself? What should be the structure? Should the heading have ">humRef" or is that not relevant?

ADD REPLY • link 8.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Yes, the transcriptome is provided in the gtf file. Bowtie/Tophat then makes a bowtie2 index of the transcriptome, and also creates other files including a fasta file of the transcriptome.

I'm sure that your dna.toplevel.fa file of the genome will have proper fasta header lines (lines starting with > ) for each of the chromosome sequences/contigs in the file. And each header line should have the name of one of the chromosomes/contigs.

ADD REPLY • link 8.6 years ago by mastal511 ★ 2.1k

0

Entering edit mode

I also tried to follow the question

Question: Tophat Error : Couldn't build bowtie index with err = 1

from Biostar and applied sed 's/^chr//g' annotation.gtf > annotation.2.gtf but I got the again error 1.

ADD REPLY • link 8.6 years ago by marongiu.luigi ▴ 710

1

Entering edit mode

Don't change the names of the chromosomes in any file after creating the index. This can only lead to more trouble.

ADD REPLY • link 8.6 years ago by GenoMax 146k

score 1 · Answer 1 · 2016-04-02

So, putting together all these comments, a solution came out.
The main problem was not the availability of the fasta file but its headings. Thus I used the Homo_sapiens.GRCh38.dna.toplevel.fa in order to match with the gtf file I downloaded from Ensembl. A second issue was that the files need to have the same root (in the example: humRef), otherwise Tophat does not recognize the fasta file and, using the index files, reconstructs a humRef.fa.
Thus, using the commands:

mv Homo_sapiens.GRCh38.dna.toplevel.fa.gz GRCh38.84.fa
mv Homo_sapiens.GRCh38.84.gtf GRCh38.84.gtf
bowtie2-build -f GRCh38.84.fa GRCh38.84
tophat2 -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr GRCh38.84

I was able to build all the reference files.
So case closed.
Thank you all.
L.