Question

Tophat syntax to align query sequences to reference index

0

Entering edit mode

8.0 years ago

marongiu.luigi ▴ 710

Dear all,
I need to align some fastq files to the human genome. I prepared the reference index files and transcriptomes with bowtie and tophat as follows:

bowtie2-build -f GRCh38.84.fa GRCh38.84
tophat2 -p 32 -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr GRCh38.84

this created the files GRCh38.84.1.bt2l, GRCh38.84.2.bt2l, GRCh38.84.3.bt2,l GRCh38.84.rev.1.bt2l, GRCh38.84.4.bt2l, GRCh38.84.rev.2.bt2l and the GRCh38.84.tr folder with the tophat's files.
I removed the Illumina adapters with trimmomatic from the input files:

java -jar /usr/bin/trimmomatic.jar PE -threads 16 -phred33 input1.fastq input2.fastq i1_paired.fastq  i2_paired.fastq   i1_unpaired.fastq  i2_unpaired.fastq  ILLUMINACLIP:./IlluminaTags/TruSeq_RNA.fa:2:30:10:1:true

then I ran Tophat:

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32  i1_paired.fastq i2_paired.fastq

but the output was:

[2016-04-03 11:19:42] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2016-04-03 11:19:42] Checking for Bowtie
          Bowtie version:    2.2.6.0
[2016-04-03 11:19:43] Checking for Bowtie index files (transcriptome)..
[2016-04-03 11:19:43] Checking for Bowtie index files (genome)..
Error: Could not find Bowtie 2 index files (i1_paired.fastq.*.bt2l)

I also provided the unpaired files with

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32  i1_paired.fastq, i1_unpaired.fastq i2_paired.fastq, i2_unpaired.fastq

and the untrimmed files:

tophat2 -o outputFolder -G GRCh38.84.gtf -p 32 --transcriptome-index=GRCh38.84.tr -p 32   input1.fastq input2.fastq

but the result was the same.
What I am getting wrong? Do I really need to index also the query files with bowtie? But in that case, what would be the use of tophat? And against what should I index the files? The human genome?
Thank you
L

rna-seq alignment • 3.1k views

ADD COMMENT • link updated 8.0 years ago by mastal511 ★ 2.1k • written 8.0 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Are GRCh38.84.tr index files located in the current folder? If not you will need to provide the full (or relative path) to the folder containing those files.

ADD REPLY • link 8.0 years ago by GenoMax 141k

score 0 · Answer 1 · 2016-04-03

0

Entering edit mode

8.0 years ago

mastal511 ★ 2.1k

First of all, the order of output files you have given in the Trimmomatic command is incorrect if you actually ran the command as shown above.

Trimmomatic syntax is ' ... paired1.fq unpaired1.fq paired2.fq unpaired2.fq ...'

In answer to your question, I guess the syntax of your tophat2 command is incorrect, and tophat is expecting to find the prefix name of the bowtie2(genome) index in the spot where you gave the name of the input fastq files. If you add the path and prefix to the bowtie2(genome) index before the list of input files I think your command should then work.

ADD COMMENT • link 8.0 years ago by mastal511 ★ 2.1k

0

Entering edit mode

but all files are in the same folder. And why it is asking for i1_paired.fastq.*.bt2l, since I did not indexed it? (Thanks also for the Trimmomatic tip...)

ADD REPLY • link 8.0 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

It doesn't want you to index the fastq files. It is asking for '1_paired.fastq.*.bt2l' because that is the name it finds in the part of the command where it expects the name of the bowtie genome index to be. Then it gives an error because of course it doesn't find an index with that name.

If you look at the tophat manual,

https://ccb.jhu.edu/software/tophat/manual.shtml

In the section 'Using Tophat' , it gives the 'usage' or syntax for the tophat command as:

Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsn_1]> [reads1_2,...readsN_2]

Note that the parameters/arguments -G and --transcriptome-index are considered as 'options' (see the heading 'Supplying your own transcript annotation data:').

ADD REPLY • link 8.0 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Thank you masta511, I tried with tophat2 -ooutputFolder -G GRCh38.84.gtf --transcriptome-index=GRCh38.84.tr -p 32 GRCh38.84 1_paired.fastq 2_paired.fastq and it seems to go through all right (although it is taking ages).

ADD REPLY • link 8.0 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Alignments are not like creating indexes. Depending on the hardware you are using and size of your dataset it may be several hours before this job will complete.

ADD REPLY • link 8.0 years ago by GenoMax 141k