Question: TopHat mapping error
0
gravatar for gokberk
7 weeks ago by
gokberk20
gokberk20 wrote:

Hi all,

I've been trying to analyze old SOLiD-seq data with TopHat-1.4.1/Bowtie-1.0.1/samtools-0.1.19. Hoever, when I run ./tophat bowtie_index/macaca_fascicularis_5.0_genome fastq/E08/SRR2930200.fastq, I receive the following output:

[Fri Apr  5 17:48:47 2019] Beginning TopHat run (v1.4.1)
-----------------------------------------------
[Fri Apr  5 17:48:47 2019] Preparing output location ./tophat_out/
[Fri Apr  5 17:48:47 2019] Checking for Bowtie index files
[Fri Apr  5 17:48:47 2019] Checking for reference FASTA file
[Fri Apr  5 17:48:47 2019] Checking for Bowtie
    Bowtie version:          1.0.1.0
[Fri Apr  5 17:48:47 2019] Checking for Samtools
    Samtools Version: 0.1.19
[Fri Apr  5 17:48:47 2019] Generating SAM header for ../bowtie_index/macaca_fascicularis_5.0_genome
    format:      fastq
    quality scale:   phred33 (default)
[Fri Apr  5 17:48:49 2019] Preparing reads
    left reads: min. length=51, count=756267
[Fri Apr  5 17:49:05 2019] Mapping left_kept_reads against macaca_fascicularis_5.0_genome with Bowtie 

gzip: stdout: Broken pipe
[Fri Apr  5 17:49:07 2019] Processing bowtie hits
Warning: junction database is empty!
[Fri Apr  5 17:50:55 2019] Processing bowtie hits
    [FAILED]
Error executing: /home/goekberk/tophat-1.4.1.Linux_x86_64/bam_merge ./tophat_out/tmp/left_kept_reads.candidates_and_unspl.bam ./tophat_out/tmp/left_kept_reads.candidates.bam ./tophat_out/tmp/left_kept_reads.unspl.bam

Here are what log files say:

bowtie.left_kept_reads.fixmap.log:

Reads file contained a pattern with more than 1024 quality values. Please truncate reads and quality values and and re-run Bowtie terminate called after throwing an instance of 'int'

long_spanning_read.log:

long_spanning_reads v1.4.1 (exported)
--------------------------------------------
Opening ./tophat_out/tmp/left_kept_reads.bwtout.z for reading
Loading reference sequences...
        reference sequences loaded.
Loading spliced hits...done
Loading junctions...done
Loading deletions...done

prep_reads.log:

prep_reads v1.4.1 (exported)
---------------------------
0 out of 756267 reads have been filtered out

sam_merge.log:

Warning: no input BAM records found.
GList error (GList.hh:970):Invalid list index: 0

Since I'm not familiar with RNAseq data analysis, I'm not sure how to fix this issue. Any help is appreciated.

Cheers!

rna-seq bowtie tophat • 247 views
ADD COMMENTlink modified 7 weeks ago by genomax67k • written 7 weeks ago by gokberk20
3

Are you really sure you need to use TopHat? And such an old version?

Did you look at all the quality score characters used in your fastq? It looks like TopHat might not be handling fastqs based on colorspace correctly.

ADD REPLYlink written 7 weeks ago by swbarnes25.6k
1

Go for HISAT2 faster and much better than Tophat V1.

ADD REPLYlink written 7 weeks ago by arup1.3k

ADD REPLYlink written 7 weeks ago by Pierre Lindenbaum120k

Yeah you might use HISAT2 or STAR instead of Tophat.

ADD REPLYlink written 7 weeks ago by chris86250
5
gravatar for h.mon
7 weeks ago by
h.mon25k
Brazil
h.mon25k wrote:

I will try again:

As you have SOLiD reads, you need a colorspace aligner, you should probably use Subread - it is the only currently maintained aligner that supperts colorspace mapping, as far as I know. It is a bad idea converting colorspace to basespace, see Convert colorspace fastq to basespace fastq and references therein.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by h.mon25k

Hi h.mon, thanks for your response, I saw this last time, but when I checked the GEO page of the data I'm trying to analyze, I saw that people used these old versions of TopHat and Bowtie to anaylze this dataset previously. So, I thought that I should go for those versions to be safe (I should also mention that I skipped adapter trimming and directly went for mapping). In anycase, I'll try Subread as well, thanks!

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by gokberk20

So, I downloaded Subread-1.6.4 and compiled it on my server and have another question. I've been trying to generate an index genome using ./subread-buildindex -c -F -o macaca_fascicularis_5.0_index ../../bowtie_index/macaca_fascicularis_5.0_genome.fa command and received the fancy output below:

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||

However, it's been stuck at this point for about two hours now, so I was wondering if something is wrong. What is the approximate time for generating an index genome with Subread? The genome assembly I'm indexing is 3GB.

ADD REPLYlink written 6 weeks ago by gokberk20
1

As long as the process is consuming CPU cycles you need to be patient. It can take a while for the index creation on big genomes.

ADD REPLYlink written 6 weeks ago by genomax67k

Thanks a lot genomax, I've just checked it, it wasn't using any CPU so I stopped and restarted it.

This is how it looks now. Do you think it looks okayish or does it use an abnormal amount of memory? I'm asking because VIRT and RES numbers looked pretty scary to me, is this the reason why the job dies after a while?

ADD REPLYlink written 6 weeks ago by gokberk20

Okay it proceeds now, hopefully will manage generating the index.

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| 601617 uninformative subreads were found.                                  ||
|| These subreads were excluded from index building.                          ||
|| Build the index...                                                         ||

Thanks!

ADD REPLYlink written 6 weeks ago by gokberk20

How much memory do you have on this machine? It is not unusual to need ~30G of RAM for genomes the size of human. If your job runs out of memory you should see some indication of that.

ADD REPLYlink written 6 weeks ago by genomax67k

I guess the total memory is 528361056 K on the server I'm working with.

ADD REPLYlink written 6 weeks ago by gokberk20
1

Then you should be all set. Allow time for the indexing to complete. Check the logs to make sure there were no errors.

ADD REPLYlink written 6 weeks ago by genomax67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1445 users visited in the last hour