TopHat mapping error
1
0
Entering edit mode
2.7 years ago
gokberk ▴ 70

Hi all,

I've been trying to analyze old SOLiD-seq data with TopHat-1.4.1/Bowtie-1.0.1/samtools-0.1.19. Hoever, when I run ./tophat bowtie_index/macaca_fascicularis_5.0_genome fastq/E08/SRR2930200.fastq, I receive the following output:

[Fri Apr  5 17:48:47 2019] Beginning TopHat run (v1.4.1)
-----------------------------------------------
[Fri Apr  5 17:48:47 2019] Preparing output location ./tophat_out/
[Fri Apr  5 17:48:47 2019] Checking for Bowtie index files
[Fri Apr  5 17:48:47 2019] Checking for reference FASTA file
[Fri Apr  5 17:48:47 2019] Checking for Bowtie
Bowtie version:          1.0.1.0
[Fri Apr  5 17:48:47 2019] Checking for Samtools
Samtools Version: 0.1.19
[Fri Apr  5 17:48:47 2019] Generating SAM header for ../bowtie_index/macaca_fascicularis_5.0_genome
format:      fastq
quality scale:   phred33 (default)
[Fri Apr  5 17:48:49 2019] Preparing reads
[Fri Apr  5 17:49:05 2019] Mapping left_kept_reads against macaca_fascicularis_5.0_genome with Bowtie

gzip: stdout: Broken pipe
[Fri Apr  5 17:49:07 2019] Processing bowtie hits
Warning: junction database is empty!
[Fri Apr  5 17:50:55 2019] Processing bowtie hits
[FAILED]


Here are what log files say:

Reads file contained a pattern with more than 1024 quality values. Please truncate reads and quality values and and re-run Bowtie terminate called after throwing an instance of 'int'


long_spanning_reads v1.4.1 (exported)
--------------------------------------------


prep_reads v1.4.1 (exported)
---------------------------
0 out of 756267 reads have been filtered out


sam_merge.log:

Warning: no input BAM records found.
GList error (GList.hh:970):Invalid list index: 0


Since I'm not familiar with RNAseq data analysis, I'm not sure how to fix this issue. Any help is appreciated.

Cheers!

bowtie tophat rna-seq • 978 views
3
Entering edit mode

Are you really sure you need to use TopHat? And such an old version?

Did you look at all the quality score characters used in your fastq? It looks like TopHat might not be handling fastqs based on colorspace correctly.

1
Entering edit mode

Go for HISAT2 faster and much better than Tophat V1.

0
Entering edit mode

0
Entering edit mode

Yeah you might use HISAT2 or STAR instead of Tophat.

5
Entering edit mode
2.7 years ago
h.mon 33k

As you have SOLiD reads, you need a colorspace aligner, you should probably use Subread - it is the only currently maintained aligner that supperts colorspace mapping, as far as I know. It is a bad idea converting colorspace to basespace, see Convert colorspace fastq to basespace fastq and references therein.

0
Entering edit mode

Hi h.mon, thanks for your response, I saw this last time, but when I checked the GEO page of the data I'm trying to analyze, I saw that people used these old versions of TopHat and Bowtie to anaylze this dataset previously. So, I thought that I should go for those versions to be safe (I should also mention that I skipped adapter trimming and directly went for mapping). In anycase, I'll try Subread as well, thanks!

0
Entering edit mode

So, I downloaded Subread-1.6.4 and compiled it on my server and have another question. I've been trying to generate an index genome using ./subread-buildindex -c -F -o macaca_fascicularis_5.0_index ../../bowtie_index/macaca_fascicularis_5.0_genome.fa command and received the fancy output below:

        ==========     _____ _    _ ____  _____  ______          _____
=====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
=====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||


However, it's been stuck at this point for about two hours now, so I was wondering if something is wrong. What is the approximate time for generating an index genome with Subread? The genome assembly I'm indexing is 3GB.

1
Entering edit mode

As long as the process is consuming CPU cycles you need to be patient. It can take a while for the index creation on big genomes.

0
Entering edit mode

Thanks a lot genomax, I've just checked it, it wasn't using any CPU so I stopped and restarted it.

This is how it looks now. Do you think it looks okayish or does it use an abnormal amount of memory? I'm asking because VIRT and RES numbers looked pretty scary to me, is this the reason why the job dies after a while?

0
Entering edit mode

Okay it proceeds now, hopefully will manage generating the index.

        ==========     _____ _    _ ____  _____  ______          _____
=====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
=====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| 601617 uninformative subreads were found.                                  ||
|| These subreads were excluded from index building.                          ||
|| Build the index...                                                         ||


Thanks!

0
Entering edit mode

How much memory do you have on this machine? It is not unusual to need ~30G of RAM for genomes the size of human. If your job runs out of memory you should see some indication of that.

0
Entering edit mode

I guess the total memory is 528361056 K on the server I'm working with.

1
Entering edit mode

Then you should be all set. Allow time for the indexing to complete. Check the logs to make sure there were no errors.