Question: Kallisto read count processed does not tally
0
gravatar for solyris83
19 months ago by
solyris8310
solyris8310 wrote:

Hi,

I am running Kallisto quant with the following command

kallisto quant -i gencodeV27 -o sample1 --bias --rf-stranded --genomebam --gtf gencode.v27lift37.annotation.gtf --chromosomes chromSize.txt sample1_1.fq sample1_2.fq

And one particular summary line below caught my attention which says (correct me if I am wrong) that there is a total of 71,414,042 paired reads in my fastq files

[quant] processed 71,414,042 reads, 65,645,163 reads pseudoaligned

The processed reads number does not tally against the reads count I get from trim_galore. I am using trimmed fastq files for this testing and the summary reported number from trim_galore after processing is found below, which means that I have only 65,853,559 (= 66089021 - 235462) reads in my input fastq file for kallisto.

RUN STATISTICS FOR INPUT FILE: sample1.fastq.gz
=============================================
66089021 sequences processed in total

Total number of sequences analysed for the sequence pair length validation: 66089021

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 235462 (0.36%)

Any comment on my interpretation above would be much appreciated. Or TLDR, why is Kallisto processing more reads than what is in the input fastq?

trim_galore kallisto • 1.2k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by solyris8310

The nomenclature is often imprecise.

One of your tools seems to report pairs, the other reads. A pair contains two reads.

ADD REPLYlink written 19 months ago by Istvan Albert ♦♦ 81k

Try checking the read counts in your file(s) more directly.

grep -e "^@" sample1_1.fq | wc -l (prone to error)

 awk '{s++}END{print s/4}' sample1_1.fq

or for the zipped:

zcat sample1.fastq.gz | awk '{s++}END{print s/4}'

In all my runs, kallisto gives the read count in single-end mode and the pair count in paired-end mode.

ADD REPLYlink modified 19 months ago • written 19 months ago by mmfansler320
1

Isn't it dangerous to check the read counts in FASTQ with "^@", given than quality lines may also start with "@" ?

ADD REPLYlink written 19 months ago by erwan.scaon720

Noted. I fixed it to recommend a less error-prone way. Main point is, though, that the questioner needs to verify the files more directly, because I can't replicate his discrepancy.

ADD REPLYlink modified 19 months ago • written 19 months ago by mmfansler320
1
gravatar for solyris83
19 months ago by
solyris8310
solyris8310 wrote:

Hi, as recommended by mmfansler, I did a check on the raw fastq files with the shell statement, and as erwan rightly pointed out a lot more "reads" are counted which are basically sequence quality with first base being @.

The read counting from awk '{s++}END{print s/4}' sample1_1.fq tallies to the result from kallisto now.

This is quite surprising as the sequencing service provider actually provided a reads number generated for each of my sample and this read count tallies exactly with the trim galore number before trimming. And this reported number (~60 million reads pair) from service provider and trim galore is actually smaller than the counted number of lines from the fastq file (~70million reads for both fastq in the pair).

Maybe I need to dig a bit deeper into what filter has been applied before the number is reported from the service provider and maybe trim galore.

Thanks a lot for the help!

ADD COMMENTlink modified 19 months ago • written 19 months ago by solyris8310

I'm wondering if maybe some of your reads have line breaks in them. Technically, the FASTQ spec includes this possibility (like FASTA), and this could cause simple line counting to also be an inaccurate measure. However, it's conventional (and recommended) to have no line breaks in the seqs and quals in FASTQ.

How long are your reads? Have you also tried counting lines in the original file, pre-trimming?

I'm also curious to know how far off the "^@" approach was. Was it counting even higher than the 71M?

ADD REPLYlink written 19 months ago by mmfansler320
0
gravatar for solyris83
19 months ago by
solyris8310
solyris8310 wrote:

Hi mmfansler,

Reads are 100bp, I tried counting on the original reads below

Read 1 match on "^@" : 75,681,793 Read 1 count lines divide by 4 : 71,552,293

Read 2 match on "^@" : 77,581,579 Read 2 count lines divide by 4 : 71,552,293

ADD COMMENTlink modified 19 months ago • written 19 months ago by solyris8310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 910 users visited in the last hour