reads diffreence in fastqc and Tophat input
0
1
Entering edit mode
7.3 years ago
zz105 ▴ 20

I am doing paired-end RNA-seq analysis, and I noticed that there are 2 files have different # of reads showing in fastqc and tophat summary. There two files also had problems running htseq-count. Does anyone know what is wrong?

Here is align_summary from tophat:

Left reads:
          Input     :   6833369
           Mapped   :   6205653 (90.8% of input)
            of these:    373370 ( 6.0%) have multiple alignments (2442 have >20)
Right reads:
          Input     :   6833369
           Mapped   :   6199565 (90.7% of input)
            of these:    373213 ( 6.0%) have multiple alignments (2441 have >20)

and the summary from fastqc:

 ##FastQC   0.10.1
>>Basic Statistics  pass
#Measure    Value   
Filename    SRR1559100_1.fastq.gz   
File type   Conventional base calls 
Encoding    Sanger / Illumina 1.9   
Total Sequences 88383287    
Filtered Sequences  0   
Sequence length 60-101  
%GC 48
RNA-Seq alignment sequence software error • 1.5k views
ADD COMMENT
2
Entering edit mode

Were the reads trimmed at some point? Tophat filters out reads that it considers to be too short, so they won't end up in its final output.

ADD REPLY
0
Entering edit mode

They were trimmed using trimmomatics before running both fastqc and tophat, and reads were dropped if their length were shorter than 60% of the read length

ADD REPLY
1
Entering edit mode

Also, if the reads contain Ns, they might not be counted (not sure, but is a suggestion).

ADD REPLY
0
Entering edit mode

I counted by grep -c 'N' file and the number does not seem to be close to the difference, but thanks for the suggesiton

ADD REPLY
1
Entering edit mode

Not sure this is the right way to do it. You might grep read names or quality lines as well this way. You should use for example bioawk to print only $seq lines and then grep -c 'N' from that output.

ADD REPLY

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6