tophat did not align all input reads
0
0
Entering edit mode
7.2 years ago
elainzp • 0

I'm having some bizarre experiences with tophat lately. I'm trying to align RNASeq reads after Trimmomatic with the command like this:

tophat -N 2 -p 24 -i 20 -G <gff3 file> --transcriptome-index=transcriptome_data/transcript_index <genomic index> Sample_trimmed_1P.gz Sample_trimmed_2P.gz

The prep_reads.info says:

left_min_read_len=50
left_max_read_len=101
left_reads_in =53082396
left_reads_out=53077267
right_min_read_len=50
right_max_read_len=101
right_reads_in =53082396
right_reads_out=53080471

Yes, this is the number of reads in my trimmed files.

However, only a small number of reads are aligned, and the alignment_summary is like this:

Left reads:
          Input     :    968581
           Mapped   :    913290 (94.3% of input)
            of these:     33332 ( 3.6%) have multiple alignments (420 have >20)
Right reads:
          Input     :    968629
           Mapped   :    915709 (94.5% of input)
            of these:     33417 ( 3.6%) have multiple alignments (344 have >20)
94.4% overall read mapping rate.

Aligned pairs:    895160
     of these:     31113 ( 3.5%) have multiple alignments
                   21492 ( 2.4%) are discordant alignments
90.2% concordant pair alignment rate.

Why is it not aligning all the reads that were input? Has this got something to do with the transcriptome index?

RNA-Seq • 1.9k views
ADD COMMENT
1
Entering edit mode

Unless I completely missed something, your alignment rate is good (94%) and your concordant pair alignment rate is also good (90%).

Your problem is the discrepancy between your Trimmomatic output (around 53 million reads) and your TopHat input (around 970 thousand reads). As Trimmomatic apparently has different output numbers for left and right reads, I would not trust this number. Did you run FastQC after Trimmomatic? What is the number of reads on the file after Trimmomatic? You may try:

wc -l cleaned_READ1.fastq

or

zcat cleaned_READ1.fastq.gz | wc -l

Divide by 4 and that is the number of reads you fed TopHat.

ADD REPLY
0
Entering edit mode

Thank you! You understood the question correctly. I have ~59 million pair-end reads in my raw file, after trimmomatic, there were ~53 million (I did fastqc). I ran tophat on my raw files, still the same problem. Alright, so I did more googling thinking it might be topcoat's problem (i.e. bugs) It turned out that the problem occur in the version of tophat that I use (v2.0.13) on biolinux. The losing reads issue arises when using -p>1. I'm updating my tophat to the latest version and hopefully the problem should go away.

ADD REPLY

Login before adding your answer.

Traffic: 2084 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6