How to visualize sequence quality control distribution
Entering edit mode
8.3 years ago
faizlotfy ▴ 30


This is my first post to Biostars, so I apologize in advance if I make any mistake. I am also a n00b to bioinformatics, so please bear with me if I can't express my question(s) correctly.

Well, I have a fastq file and I want to write an R function to get quality scores for each sequence and summarize them using a boxplot, such that the distribution of sequencing quality scores can be visualized.

My problem is that I don't know how to do that? I know about Fastq format and that the 2nd line contains the sequence and that the 4th line contains the quality ASCII format. I also have the conversion table from an ascii character to the quality score, but my problem as I said is that I don't understand the question (at least the first part) and how to plot the summarizing.

What I could do till now is that I could format the file so I kept every pther fourth line, and them used awk (or grep) to find the frequency of each character in the file; but this would produce a histogram distribution not a boxplot.

I will appreciate your help.

sequence Fastq R • 5.3k views
Entering edit mode

If you have 50bp in a read (lets say), you have 50 quality scores. Do you want 1 quality metric for each read, or 50 quality scores independent of their position (your grep), or 50 quality scores linked to their position in the read, or 50 quality scores linked to the base that was sequenced?

Also the data that goes into a histogram is the same as what goes into a box plot :)

Entering edit mode

Thank you for your reply.

Since I do not know exactly what do all these choices mean, I would choose "one quality metric for each read." and if possible, the last choice, quality scores linked to the base that was sequenced. I know it sounds odd, but I promise to do my homework.

As a side question, for a boxplot, one would need some numeric data for each ASCII character, right? I mean, for example, the "H" character is repeated 234988 times, I understand this can be represented on a histogram (with other characters represented, which are 38 in my case) where on the y-axis, their frequencies will be represented. How would I get a boxplot for "H" knowing that it was repeated 234988 times?

Entering edit mode

You're going to do well at Bioinformatics - you're asking all the right sorts of questions and you're clearly thinking things through and not just asking for others to do it for you :) Trés cool.

But there are two things you should learn right away:

  1. Before you try and reinvent the wheel, particularly when you are just starting out, it pays to check if a solution already exists. The most popular tool in this space is FastQC, but it only takes BAM/SAM files I think. There are many others though, such as or Shell utilities for QC of FASTQ/FASTA files (based on perl one-liners) which are designed for FASTA files.
  2. All of the tools I can see for FASTA files link the quality to the base position or DNA & position in a graph like this:

(this shows the quality dropping as you go from the first base to the last [40th] base)

When I asked if you wanted a single quality score for the whole read, I must confess that was sort of a trick, because you'd only want that if you didnt really understand how the format of the read quality works - each ASCII letter in the quality row is paired with the sequenced base in the sequence row. Your options to deal with poor quality reads is drop them, or trim them from the ends. For this reason a single quality stat for each read is not so useful. You want something like the above so you can say - "ok, quality becomes unacceptable around bp 25, so i will trim all my reads to be 25bp in length". Having said that, for a very high-level overview of the data, a single 1D distribution of the data might be useful. The problem here is how do you take X quality scores in a single read and condense them into a single quality score? The ASCII value represent numbers, so thats OK - we can take the mean, median, mode, min, or max of these values as our single stat for the read - but which do we take? Then with that stat for each read, plot a distribution (as a histogram, or box plot, or whatever).

So point 2) is really - 'know your data' - because until you know what you have in the file, it is very difficult to ask questions that make sense.

Still, I think you'll do well in bioinformatics - not that my opinion means anything :P

Entering edit mode

Thank you very much for your encouraging words and sorry for late reply. I have been reading in genetics like crazy and also was reading more about some bioconductor packages that I can use in the future.

Thank you again for your time and advice.


Login before adding your answer.

Traffic: 3051 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6