Question: Tophat Building transcriptome files synthax
0
gravatar for marongiu.luigi
3.1 years ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Hello,
I need to create the Tophat reference index of the human genome. I therefore downloaded the GTF file (humRef.gtf) from Ensembl and used Bowtie to create an index of the human genome with a single fasta file (dna.toplevel) with all human chromsomes, which I renamed humRef.fa, as follows:

bowtie2-build -f ./humRef.fa humRef

this created the following files: humRef.1.bt2, humRef.2.bt2, humRef.3.bt2, humRef.4.bt2, humRef.rev.1.bt2 and humRef.rev.2.bt2.
I ran tophat with the following command:

tophat2 -G humRef.gtf --transcriptome-index=humRef_tr humRef

but I got the following error:

[2016-03-30 20:05:55] Building transcriptome files with TopHat v2.1.1
-----------------------------------------------
[2016-03-30 20:05:55] Checking for Bowtie
          Bowtie version:    2.2.6.0
[2016-03-30 20:06:44] Checking for Bowtie index files (genome)..
[2016-03-30 20:06:44] Checking for reference FASTA file
[2016-03-30 20:06:44] Building transcriptome data files humRef_tr/humRef
[2016-03-30 20:09:02] Building Bowtie index from humRef.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1

Could you please tell me what is wrong? It must be a syntax issue but I don't know which one. I tried to provide paths to the files but nothing, and also changing the names of the index files. All files are in the same directory.
I also tried to remove the .fa extension of humRef.fa file, following a suggestion from internet, but the answer was:

[2016-03-30 20:30:49] Checking for reference FASTA file
    Warning: Could not find FASTA file humRef.fa
[2016-03-30 20:30:49] Reconstituting reference FASTA file from Bowtie index

a solution that I don't think is efficient since the file is available
Thank you,
L

rna-seq assembly • 2.6k views
ADD COMMENTlink modified 11 months ago by zx87547.1k • written 3.1 years ago by marongiu.luigi380

The syntax of your special TopHat run command is correct. This problem may have been created by the unfortunate choice of using the prefix of the fasta file name as your genome index name. Are all these files in the current working directory? Are you running all commands from a same user account?

ADD REPLYlink written 3.1 years ago by genomax65k

Actually I started by separating the files by suffices, so I had the humRef.fa for the reference fasta, then _idx for the Bowtie's index and _tr for the tophat output, but I got the same error, so I thought the files had to be with the same names in order to be recognized by tophat. The files are all in the same directory and run with the same user, actually one command after the other.

ADD REPLYlink written 3.1 years ago by marongiu.luigi380

I think TopHat expects the transcriptome index to be in a separate directory, so possibly you should provide a directory name as well as a prefix name for the transcriptome index files.

ADD REPLYlink written 3.1 years ago by mastal5112.0k

But I also tried tophat2 -G ./humRef.gtf --transcriptome-index=humRef_tr ./humRef to tell that the files were in the working directory, unsuccessful. I copied all the index/fa/gtf files into a humRef directory so that tophat could find the fasta file wherever it wanted but the error was the same.

ADD REPLYlink written 3.1 years ago by marongiu.luigi380

Can you try creating the transcriptome index somewhere else. Like this

$ tophat2 -G ./humRef.gtf --transcriptome-index=/some_other_path/humRef_tr ./humRef
ADD REPLYlink written 3.1 years ago by genomax65k

I tried with ... -index=./top/humRef_tr ... and the directory top was created, inside the usual humRef_th.fa humRef_th.fa.tlst humRef_th.gff humRef_th.ver files were created but then the Building Bowtie index from humRef_th.fa [FAILED] appeared.

ADD REPLYlink written 3.1 years ago by marongiu.luigi380

I think part of the problem is that if everything works correctly you should end up with a bowtie index for the genome AND a different bowtie index for the transcriptome (and also a fasta file for the genome AND another fasta for the transcriptome), and your naming everything humRef just makes things very confusing.

ADD REPLYlink written 3.1 years ago by mastal5112.0k

So is the syntax for Bowtie wrong? Shouldn't the transcriptome be provided in the special gtf format, which is also the first argument of the function? Might the problem be present in the fasta file itself? What should be the structure? Should the heading have ">humRef" or is that not relevant?

ADD REPLYlink written 3.1 years ago by marongiu.luigi380

Yes, the transcriptome is provided in the gtf file. Bowtie/Tophat then makes a bowtie2 index of the transcriptome, and also creates other files including a fasta file of the transcriptome.

I'm sure that your dna.toplevel.fa file of the genome will have proper fasta header lines (lines starting with > ) for each of the chromosome sequences/contigs in the file. And each header line should have the name of one of the chromosomes/contigs.

ADD REPLYlink written 3.1 years ago by mastal5112.0k

I also tried to follow the question

Question: Tophat Error : Couldn't build bowtie index with err = 1

from Biostar and applied sed 's/^chr//g' annotation.gtf > annotation.2.gtf but I got the again error 1.

ADD REPLYlink written 3.1 years ago by marongiu.luigi380
1

Don't change the names of the chromosomes in any file after creating the index. This can only lead to more trouble.

ADD REPLYlink written 3.1 years ago by genomax65k
1
gravatar for marongiu.luigi
3.1 years ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

So, putting together all these comments, a solution came out.
The main problem was not the availability of the fasta file but its headings. Thus I used the Homo_sapiens.GRCh38.dna.toplevel.fa in order to match with the gtf file I downloaded from Ensembl. A second issue was that the files need to have the same root (in the example: humRef), otherwise Tophat does not recognize the fasta file and, using the index files, reconstructs a humRef.fa.
Thus, using the commands:

mv Homo_sapiens.GRCh38.dna.toplevel.fa.gz GRCh38.84.fa
mv Homo_sapiens.GRCh38.84.gtf GRCh38.84.gtf
bowtie2-build -f GRCh38.84.fa GRCh38.84
tophat2 -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr GRCh38.84

I was able to build all the reference files.
So case closed.
Thank you all.
L.

ADD COMMENTlink written 3.1 years ago by marongiu.luigi380

If you are happy with this solution go ahead and "accept" own your solution (check mark).

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour