Question: Building Bowtie index failure
0
gravatar for Cricket
3.5 years ago by
Cricket10
Cricket10 wrote:

Greetings: I am using Tophat2 (command line) to analyze RNA-seq data and I am encountering some errors.

Here is the call:

tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RNA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RNA_LBG01b_241_filteredQ.fastq

Here is the error:

[2015-12-29 12:58:33] Checking for Bowtie
          Bowtie version:     2.2.4.0
[2015-12-29 12:58:33] Checking for Bowtie index files (genome)..
[2015-12-29 12:58:33] Checking for reference FASTA file
[2015-12-29 12:58:33] Generating SAM header for indices/BA000007.2
[2015-12-29 12:58:33] Reading known junctions from GTF file
    Warning: TopHat did not find any junctions in GTF file
[2015-12-29 12:58:33] Preparing reads
     left reads: min. length=12, max. length=342, 202732 kept reads (1315 discarded)
Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places
[2015-12-29 12:58:39] Building transcriptome data files transcriptome_data/RNA_LBG01b_241_filteredQ
[2015-12-29 12:58:40] Building Bowtie index from RNA_LBG01b_241_filteredQ.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1

Version Information:

TopHat v2.1.0        Bowtie2 version 2.2.4              Python 2.7.10 :: Anaconda 2.4.0 (64-bit)

System Information:

CentOS Release 6.7

How I got here and what have I tried:

I am using E. coli (Accession: BA000007.2) for my reference genome which can be found here: http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

I obtained my GTF file from Ensemble (ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/)

I made my indices using bowtie2-build (before tophat2 call)

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

I am aware that the error I am receiving is affiliated with different names appearing in the first column in the *.gtf file and the name of the reference fasta file. If I understand this correctly, every entry in the 1st column should be BA000007.2 where most of the names in the 1st column where "Chromosome". To fix this, I did the following:

awk '{FS=OFS="\t"}{print "BA000007.2", $2, $3, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf 

#please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create undesirable output from the awk command has been addressed 

I also changed the termination of the fasta file from *.fasta to *.fa

And to make sure that bowtie can access the fasta reference file, the reference file is not only in the directory with the gtf file, but ALL immediate subdirectories!

Questions:

1. Did I properly put the kibosh on any problems arising from differences in naming between the 1st column of the gtf file and the name of the fasta file (BA000007.2, BA000007.2.fa)?

2. When I peruse output in the logs directory, there are several errors (g2f.err & similar errors in ftf_juncs.log) with lines beginning with:

Warning: invalid start coordinate at line: 
BA000007.2    ena    gene    -194    2502    .    +    .    gene_id "BAA31757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";

There are indeed negative numbers in the gtf files, but not in the genbank file (quick search in vim). Could this be the source of the error? I commented out the specific lines and then deleted them from the file -- both approaches still result in the error.

3. Is there anything readily seen that could be causing the "Couldn't build bowtie index with err = 1" error?

 

I have been stuck on this for a couple of days so any help is greatly appreciated.

rna-seq tophat2 bowtie2 • 3.4k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Cricket10
Hi;
have you tried    "bowtie2-build -f ref_data/BA000007.2.fasta ref_data/BA000007.2"    instead of your previous command line "bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2"
ADD REPLYlink written 3.5 years ago by Mehmet470

Since the files would be the same, I copied the index files from indices directory to ref_data directory. I re-ran it with the same error.
 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Cricket10
1
gravatar for Cricket
3.5 years ago by
Cricket10
Cricket10 wrote:

Found the source of the problem...it was my referential fasta file. The header needed to be ">BA000007". Changing that solved everything.

ADD COMMENTlink written 3.5 years ago by Cricket10
0
gravatar for biocyberman
3.5 years ago by
biocyberman770
Denmark
biocyberman770 wrote:

Make sure you have chromosome name the same in the GTF and the sequence ID in the reference fasta file. They look different to me.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by biocyberman770

Thank you biocyberman for your response.  I believe I addressed your suggestion already with the awk command -- that was the entire purpose of the code. After running the command, all entries in the first column will be BA000007.2. The name of my reference fasta file is BA000007.2.fa.

If this is not the correct approach please let me know.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Cricket10

The awk command may not give what you expect. Check the output. I tried the command on the gtf file and it replaces the comment lines as well. Try this one: awk 'BEGIN{OFS="\t"}($0 !~/^#/){$1 = "BA000007"} {print}'  (I removed the .2 vesion number, just in case other tools do not like the dot). After that rerun the tophat command that tries to create the transcriptome index. Or, run tophat without specify --transcriptome-index option.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by biocyberman770

Biocyberman -- thank you for your attention and assistance.

 I am aware that the first lines of the file will not allow me to do what I intended with the awk command. I dealt with it already and mentioned it within the text of my question (#please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create undesirable output from the awk command has been addressed).

As for your other suggestions...

I removed the ".2" in vim (:%s/^BA000007.2/BA000007/g) from the 1st column in the gtf file. I changed the name of the fasta file to reflect this and re-ran it and received an identical error.

I removed the --transcriptome-index from the code, ran it again, and received an identical error.

Do you have any other suggestions?

 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Cricket10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 951 users visited in the last hour