Question: Tophat syntax to align query sequences to reference index
0
gravatar for marongiu.luigi
3.5 years ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Dear all,
I need to align some fastq files to the human genome. I prepared the reference index files and transcriptomes with bowtie and tophat as follows:

bowtie2-build -f GRCh38.84.fa GRCh38.84
tophat2 -p 32 -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr GRCh38.84

this created the files GRCh38.84.1.bt2l, GRCh38.84.2.bt2l, GRCh38.84.3.bt2,l GRCh38.84.rev.1.bt2l, GRCh38.84.4.bt2l, GRCh38.84.rev.2.bt2l and the GRCh38.84.tr folder with the tophat's files.
I removed the Illumina adapters with trimmomatic from the input files:

java -jar /usr/bin/trimmomatic.jar PE -threads 16 -phred33 input1.fastq input2.fastq i1_paired.fastq  i2_paired.fastq   i1_unpaired.fastq  i2_unpaired.fastq  ILLUMINACLIP:./IlluminaTags/TruSeq_RNA.fa:2:30:10:1:true

then I ran Tophat:

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32  i1_paired.fastq i2_paired.fastq

but the output was:

[2016-04-03 11:19:42] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2016-04-03 11:19:42] Checking for Bowtie
          Bowtie version:    2.2.6.0
[2016-04-03 11:19:43] Checking for Bowtie index files (transcriptome)..
[2016-04-03 11:19:43] Checking for Bowtie index files (genome)..
Error: Could not find Bowtie 2 index files (i1_paired.fastq.*.bt2l)

I also provided the unpaired files with

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32  i1_paired.fastq, i1_unpaired.fastq i2_paired.fastq, i2_unpaired.fastq

and the untrimmed files:

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32   input1.fastq input2.fastq

but the result was the same.
What I am getting wrong? Do I really need to index also the query files with bowtie? But in that case, what would be the use of tophat? And against what should I index the files? The human genome?
Thank you
L

rna-seq alignment • 2.0k views
ADD COMMENTlink modified 3.5 years ago by mastal5112.0k • written 3.5 years ago by marongiu.luigi380

Are GRCh38.84.tr index files located in the current folder? If not you will need to provide the full (or relative path) to the folder containing those files.

ADD REPLYlink written 3.5 years ago by genomax73k
0
gravatar for mastal511
3.5 years ago by
mastal5112.0k
mastal5112.0k wrote:

First of all, the order of output files you have given in the Trimmomatic command is incorrect if you actually ran the command as shown above.

Trimmomatic syntax is ' ... paired1.fq unpaired1.fq paired2.fq unpaired2.fq ...'

In answer to your question, I guess the syntax of your tophat2 command is incorrect, and tophat is expecting to find the prefix name of the bowtie2(genome) index in the spot where you gave the name of the input fastq files. If you add the path and prefix to the bowtie2(genome) index before the list of input files I think your command should then work.

ADD COMMENTlink written 3.5 years ago by mastal5112.0k

but all files are in the same folder. And why it is asking for i1_paired.fastq.*.bt2l, since I did not indexed it? (Thanks also for the Trimmomatic tip...)

ADD REPLYlink written 3.5 years ago by marongiu.luigi380

It doesn't want you to index the fastq files. It is asking for '1_paired.fastq.*.bt2l' because that is the name it finds in the part of the command where it expects the name of the bowtie genome index to be. Then it gives an error because of course it doesn't find an index with that name.

If you look at the tophat manual,

https://ccb.jhu.edu/software/tophat/manual.shtml

In the section 'Using Tophat' , it gives the 'usage' or syntax for the tophat command as:

Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsn_1]> [reads1_2,...readsN_2] 

Note that the parameters/arguments -G and --transcriptome-index are considered as 'options' (see the heading 'Supplying your own transcript annotation data:').

ADD REPLYlink written 3.5 years ago by mastal5112.0k

Thank you masta511, I tried with tophat2 -ooutputFolder -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr -p 32 GRCh38.84 1_paired.fastq 2_paired.fastq and it seems to go through all right (although it is taking ages).

ADD REPLYlink written 3.5 years ago by marongiu.luigi380

Alignments are not like creating indexes. Depending on the hardware you are using and size of your dataset it may be several hours before this job will complete.

ADD REPLYlink written 3.5 years ago by genomax73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1365 users visited in the last hour