Question

fastq file sub-sampling during per position quality score graphing

0

Entering edit mode

4.0 years ago

Anand Rao ▴ 630

QUESTION While generating per position quality score graphs, is it possible to suppress sub-sampling and report for an entire FASTQ input file?

BACKGROUND INFO I ask this because of my observation that both FastQC and FASTX_Toolkit sub-sample the input file, for reasons I assume are related to run-time?!

My conclusion is based on combining these 3 observations:

Observation 1. I ran EA-Utils' fastq-stats on the input = yields the lowest Q-score of 5, as shown in it's STDOUT below

fastq-stats SRR_BBsplit_Sm1021_Scrubbed.fq.gz 
reads   11269161
len 100
len mean    94.7008
len stdev   10.8264
len min 50
phred   33
window-size 2000000
cycle-max   35
dups    1999313
%dup    17.7415
unique-dup seq  205300
min dup count   2
dup seq     1   2660    CTTTTTTGCACACTGAGATCATTAAAGGACCTCAT
dup seq     2   1897    CTTAAATTAGGTGTTATAAATTTGAAGTTAAGGTG
dup seq     3   1049    CACAAGTCTACATACTTAAATTAGGTGTTATAAAT
dup seq     4   1007    CTTGGTTCTCCTCCACAACAACAGCCTTGTTGGGT
dup seq     5   835 CTACAAGTCACCTCCTCCTCCAACACCAGTTTACA
dup seq     6   756 CTTGTATACAGGTGATGGTGGAGGAGGTGACTTGT
dup seq     7   750 CTACAATTCACCACCTCCTCCAACACCAGTTTACA
dup seq     8   723 CTCATCTCAATGAACATAACATAACATAACAAAGA
dup seq     9   717 CTTGTACACGTAAGTTGGTGATGGTGGAGGTGGTG
dup seq     10  691 CTCTGCTTCAAGAGGCATATGATGCACTTCATTTG
dup mean    10.7385
dup stddev  18.9257
qual min    5
qual max    41
qual mean   37.5495
qual stdev  3.7967
%A  29.1245
%C  22.2476
%G  19.4687
%T  29.1593
%N  0.0000
total bases 1067198823

Observation 2. I ran FASTX-Toolkit on the input using instructions at http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

zcat SRR.fastq.gz | fastx_quality_stats -o SRR.fastx_stats
fastq_quality_boxplot_graph.sh -i SRR.fastx_stats -o SRR.fastx_stats.png -t "Test"

FASTX-Toolkit results image shown below
add logo to picture

Observation 3. Finally I ran FastQC on the input (pretty standard)

fastqc SRR_BBsplit_Sm1021_Scrubbed.fq.gz

FASTQC results image shown below

Thank you! Stay sane! Stay safe!

PS. I do understand that sampling a large subset of reads (few millions) from a much larger number file (tens or hundreds of millions reads) is statistically very acceptable. I am not arguing against this established and valid convention. I'd simply like to know how my per position quality score graph would look if I were to not sub-sample at all, but look at the entire file. Is this possible currently with any off-the-shelf bioinformatics tool?

FASTX_Toolkit FastQC • 1.0k views

ADD COMMENT • link 4.0 years ago by Anand Rao ▴ 630

0

Entering edit mode

I'd like to know how my per position quality score graph would look if I were to not sample at all, but look at the entire file.

You used three different methods above so they likely sampled your data in different ways. Results look about the same. So what do you think will happen if no sampling occurred? Assuming original files has not been sorted/deduplicated/trimmed/otherwise changed.

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

You used three different methods above so they likely sampled your data in different ways. Results look about the same. So what do you think will happen if no sampling occurred? Assuming original files has not been sorted/deduplicated/trimmed/otherwise changed.

Genomax, thanks for replying. Yes, I suspect sampling might be performed differently by each tool, or even random sub-sampling by same tool might return slightly different results depending on "seed" for randomness etc.

In any case, the text data from EA-Utils fastq-stats gives only aggregated min, max, stdev and mean - across all positions, not per position.

Furthermore, it does not return IQ range, quantiles or quartiles...for the sort of graph FastQC or FASTX-Toolkit returns. So it's hard to predict how exactly the plot will change... on a per position basis...

In the box and whiskers plot, if all data are included, I suppose the whiskers will extend out farther - question is how much farther for EACH position! I am assuming it will NOT be Q=5 for each position.... May be just at the 3'd end?! Rather than assume, I want to see...

I predict :

Relatively speaking the median should be the most stable
1st and 3rd quartile will change, but probably not visibly,
and whiskers should change for some positions - they are edge cases... not sure WHICH positions - which is my curiosity...

So it'd be nice to visualize a density plot of Q scores for each position. Is this do-able by parsing output of some pre-existing tool?

ADD REPLY • link 4.0 years ago by Anand Rao ▴ 630

score 0 · Answer 1 · 2020-04-30

ANSWER (based on email exchange with Simon Andrews @ Babraham Institute, UK - source of the FastQC software suite)

1. FastQC per position quality score graph and composition plots are based on entire input file, without any sub--sampling.

2. The whiskers in the FastQC plot do NOT include the extremes, but exclude the lowest and highest deciles. Quoting Simon here -

The whiskers on the graph don’t represent the lowest/highest values, they’re the 10th and 90th percentiles of the full data, so it takes a reasonable proportion of the library to have rubbish scores at a particular position for the whisker to be skewed.

3. Should I want to include ALL the quality data per read position, I quote Simon again -

the only way to do that would be to change the values in the PerBaseQualityScores.java file and recompile.

4. This means EA-Utils' fastq-stats is correct and so is FastQC.

5. Though I have not heard back from the author of FASTX-Toolkit, I suspect the same thing is going on there as well.

Bottomline 1: FASTX-Toolkit aggregate data summary is not inconsistent with the FastQC chart (and likely same for FASTX_Toolkit as well)

Bottomline 2: Always be aware of the definition of whisker in a box plot, there are variations, see under examples listed on it's wiki page!