Question

Tophat 2.1.1 failing to recognize fasta file

0

Entering edit mode

7.0 years ago

kallen83 ▴ 10

I'm trying to align a set of reads in fasta format to a reference genome (indexed with bowtie2-build), but tophat is failing with the following error:

tophat thc_genomic 5369.fa

[2017-04-20 10:14:53] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2017-04-20 10:14:53] Checking for Bowtie
          Bowtie version:    2.3.1.0
[2017-04-20 10:14:53] Checking for Bowtie index files (genome)..
[2017-04-20 10:14:53] Checking for reference FASTA file
[2017-04-20 10:14:53] Generating SAM header for thc_genomic
[2017-04-20 10:14:53] Preparing reads
     left reads: min. length=95, max. length=101, 4488 kept reads (0 discarded)
[2017-04-20 10:14:54] Mapping left_kept_reads to genome thc_genomic with Bowtie2 
    [FAILED]
Error running bowtie:
Error: reads file does not look like a FASTQ file
Error: Encountered exception: 'Unidentified exception'
Command: /usr/local/bin/../Cellar/bowtie2/2.3.0/bin/bowtie2-align-s --wrapper basic-0 -k 20 -D 15 -R 2 -N 0 -L 20 -i S,1,1.25 --gbar 4 --mp 6,2 --np 1 --rdg 5,3 --rfg 5,3 --score-min C,-14,0 -p 1 --sam-no-hd -x thc_genomic - 
(ERR): bowtie2-align exited with value 1

(sorry the cut and paste is making that a little hard to read).

I agree that the input file doesn't look like fastq because actually, it's not. It's a fasta file, and the docs say that tophat now auto-detects fasta and behaves accordingly. I don't see a command line flag to force it to treat the input file as fasta -- what am I missing?

My input read file looks like this:

>gnl|SRA|SRR352208.104462060.1 HWI-ST600:7:68:16692:76873
TAAGTTTAATAAATCTAAGTCAAAGTTCAATATCGGTCCCTTCCGATGCATACTCCATGCATCCAACCTG
AGCTTTACTTTAACCAATGCTCTGGAAAGAA

>gnl|SRA|SRR352208.103232838.2 HWI-ST600:7:67:16843:150317
AACTAGGCGATATTGACTTGAATAGATATTACGGTAAGTTTAATAAATCTAAGTCAAAGTTCAATATCGG
TCCCTTCCGATGCATACTCCATGCATCCAAC

Surely I'm missing a command line flag, or does tophat need the fasta id line to be formatted a specific way?

Thanks

RNA-Seq alignment • 2.8k views

ADD COMMENT • link updated 7.0 years ago by GenoMax 141k • written 7.0 years ago by kallen83 ▴ 10

0

Entering edit mode

I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing (Screenshot courtsey of @Wouter).

ScreenCap

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

I know by personal experience that Tophat2 is picky about read names. Try with a small subset of reads: rename them like ">read1" to ">readN" and see if it works! Perhaps it doesn't like the illumina read name format in a FASTA but only in a FASTQ.

ADD REPLY • link 7.0 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Yeah, I tried that, with the id lines changed just the way you suggested, but I get the same result.

ADD REPLY • link 7.0 years ago by kallen83 ▴ 10

0

Entering edit mode

Are those actual white lines between your fasta records?

You should also know that Tophat is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap,...

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

1

Entering edit mode

the blank line between the two fasta records is an artifact of cutting and pasting from a terminal window. It's not present in the actual fasta file.

But your second point is very interesting, I didn't know the field moved on, I'm going to read that paper right now (and I have -cough- ways -cough- of getting around most access restrictions, but I appreciate the offer).

Thanks for the advice.

ADD REPLY • link 7.0 years ago by kallen83 ▴ 10

0

Entering edit mode

Any splice-aware aligner (BBMap, STAR are good options) along with featureCounts from Subread package would be a great option to look into.

ADD REPLY • link 7.0 years ago by GenoMax 141k