I am running Kallisto quant with the following command
kallisto quant -i gencodeV27 -o sample1 --bias --rf-stranded --genomebam --gtf gencode.v27lift37.annotation.gtf --chromosomes chromSize.txt sample1_1.fq sample1_2.fq
And one particular summary line below caught my attention which says (correct me if I am wrong) that there is a total of 71,414,042 paired reads in my fastq files
[quant] processed 71,414,042 reads, 65,645,163 reads pseudoaligned
The processed reads number does not tally against the reads count I get from trim_galore. I am using trimmed fastq files for this testing and the summary reported number from trim_galore after processing is found below, which means that I have only 65,853,559 (= 66089021 - 235462) reads in my input fastq file for kallisto.
RUN STATISTICS FOR INPUT FILE: sample1.fastq.gz ============================================= 66089021 sequences processed in total Total number of sequences analysed for the sequence pair length validation: 66089021 Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 235462 (0.36%)
Any comment on my interpretation above would be much appreciated. Or TLDR, why is Kallisto processing more reads than what is in the input fastq?
The nomenclature is often imprecise.
One of your tools seems to report pairs, the other reads. A pair contains two reads.
Try checking the read counts in your file(s) more directly.
(prone to error)
grep -e "^@" sample1_1.fq | wc -l
or for the zipped:
In all my runs, kallisto gives the read count in single-end mode and the pair count in paired-end mode.
Isn't it dangerous to check the read counts in FASTQ with "^@", given than quality lines may also start with "@" ?
Noted. I fixed it to recommend a less error-prone way. Main point is, though, that the questioner needs to verify the files more directly, because I can't replicate his discrepancy.