Quality control and filtering of sequencing reads is one of the most important steps in the pre-processing of sequencing reads. However, it is not always trivial to figure out which reads needs adjustment and which can be left untouched. In this tutorial, we explain the basics of the Phred score concept and introduce important quality metrics used in a majority of quality control bioinformatics tools such as FASTQC. We also demonstrate how to understand and interpret these quality metrics. We use an open dataset from a recently published paper with an exciting theme: a study of a multi-drug resistance in Mycobacterium tuberculosis. This is a first tutorial from a set of tutorials on read quality improvement.
Researchers access information encoded in organisms’ DNA by sequencing their genomes. Special sequencing platforms take chemically processed DNA as an input and output digital files with genetic information. Importantly, sequencing platforms do not output entire organismal DNA as one long stretch, but instead produce a huge set of short fragments, each containing only a small fraction of genetic information. These fragments are called ‘sequencing reads’ and they are stored in so-called ‘FASTQ’ files.
Sequencing reads can be assembled de novo into a full genome or mapped to an already-assembled reference genome of a related organism. However, no sequencing technology is perfect and raw reads inevitably contain mistakes: sequencing errors. The probability of an error for each nucleotide of each read is always written in a FASTQ file. Therefore, the very first step of fragment analysis is quality control and filtering. This step aims to remove low quality reads.
Before moving on to practical bioinformatics work, let’s explore the structure of the FASTQ file and understand how information about read quality is encoded. FASTQ files always have 4 lines per sequence. The first line shows the sequence ID and an optional description. The second line contains a sequence of nucleotides. The third line generally holds only a/the “+” symbol and occasionally, the same ID and sequence description as the first line. The fourth line displays the quality score of each nucleotide shown on the second line.
The quality scores on the fourth line represent the probability of a sequencing error at each nucleotide position. If we know the probability of such an error, then it is easy to obtain a quality score using the following equation:
For example, if the probability of an error (p) equals 0.01, then the corresponding quality score will be 20; if p = 0.001, then Q=30. But, as we can see in the fourth line of the example, the FASTQ file does not contain any double-digit numbers and instead has only some odd characters. These are special ASCII characters that are used to encode quality values with a single symbol, rather than a double or triple digit. Such single character encoding is necessary because we want to have one-to-one correspondence between each nucleotide on line 2 and each score on line 4. This is only achievable with the use of a single character on both lines.
The first 32 items in the ASCII table are non-printing characters. Therefore, to assign a printing symbol to a quality value, we should start from the 33rd item. It means that the quality value of 1 will correspond to the symbol “!”, a quality value of 10 to “*”, and so on. This type of quality encoding is called Phred33 and is the most common way of storing sequence quality information. For example, Phred33 is used in the output files of Illumina sequencing platforms.
Now that we are familiar with the structure of FASTQ files and the concept of a Phred score, we can learn how to (1) assess the quality of DNA sequencing data, and (2) filter out low quality sequences in order to achieve better results in the downstream analysis.
As a matter of common courtesy, researchers publish source data to their papers, in order that a reader can check the results or do an additional analysis. We chose a recent paper with published data and an exciting theme: a study of a multi-drug resistance in Mycobacterium tuberculosis. We will play the role of bioinformatician working in the field of epidemiology and attempt to repeat the quality filtering performed in the paper. The InsideDNA platform will help us with the computations and give us hands-on experience of a large-scale production cluster.
Download raw reads Source data for this research can be found here. Our analysis won’t be as thorough as the original one, and we will work only on a single bacterial genome, rather than all 154 genomes. To download the source FASTQ files for the bacterial strain of interest, choose submitted FASTQ (ftp) files with experiment accession ERX509713. Note that you need a pair of files for each sample – the one with the forward reads, and the one with the reverse. Alternatively, you can obtain the necessary files from the ready-to-use archive by clicking on the ‘Download data’ button at the top of this tutorial.
Create new project and upload source data Log in to the InsideDNA application and navigate to the Files tab. Create a new folder called, “Tuberculosis”. Upload the files with raw reads into this folder. Now you are ready for genome processing.
Assess the quality of raw read data Now we need to assess the quality of the source data. The most convenient tool for this task is the FASTQC tool. This tool takes FASTQ files as input and returns an HTML report containing well-structured and graphically illustrated information about different aspects of the read quality. We will run the FASTQC tool using Console (i.e. web shell). This is the most common way of working for bioinformaticians, but you can also run most of the tools via a web UI. To launch Console, navigate to the Files tab and click on the Console button.
To run an analysis on the InsideDNA cloud-based cluster, we need to use a special wrapper called idna_submit.py. This wrapper takes care of computational resources and helps to distribute the workload and computations between compute nodes in the cloud. To correctly run idna_submit.py, you need to specify special parameters in the following format:
idna_submit.py [-t TASKNAME] [-c CPU] [-r RAM] [-e “COMMAND”]
The -e “COMMAND” should call the tool with all necessary parameters just like you would do from a normal console. Please, remember that all tools in InsideDNA must start from idna_ prefix (e.g. idna_bowtie2, idna_fastqc).To check FASTQC parameters, type the following into the console:
idna_submit.py -th idna_fastqc
FASTQC has many parameters, but in a simple case we only need to specify -o (path to output directory) and provide a list of FASTQ files on which we want to run quality control. In our example we should type the following:
idna_submit.py -t tub_fastqc -c 8 -r 7.2 -e "idna_fastqc /data/userXXX/Tuberculosis/2009-08_lib324_miseq_r0030_251bp_R1.fastq.gz -o /data/userXXX/Tuberculosis".
Here we have assigned the task name “tub_fastqc” and ran it using 8 cores CPU and 7.2 Gb RAM. Our FASTQ file is called 2009-08_lib324_miseq_r0030_251bp_R1.fastq.gz and is placed in the folder /data/userXXX/Tuberculosis. Please note that you should enter your own user ID instead of userXXX. This user ID is written in the header of the Console. Finally, the output files will be placed in the folder /data/userXXX/Tuberculosis.
Submit the task by pressing the Enter key.
You can monitor the progress of your task in the Tasks tab.
When the task is completed, navigate to the Files tab, and then to the folder you specified as the output location. The graphic report, which is the most interesting output for us, is the .html file. Download it, and open it in any browser.
The FASTQC report is divided into several sections. For example, you can find general information about number and length of reads in the first section, called, “Basic statistics”. The list of potentially undeleted adapters will be in the section, titled, “overrepresented sequences”. Detailed explanation related to the meaning of, and potential problems with, each section of the FASTQC report can be found on this page. We will discuss the quality of our tuberculosis dataset below.
You can navigate to the corresponding parts of the report by clicking on the items in the table of contents placed on the left. Each item in the table of contents is marked by a green tick, an orange exclamation mark, or a red cross. A green tick means that this aspect of the data is of satisfactory quality, an orange exclamation mark means that something is slightly abnormal, and a red cross suggests that the corresponding parameter of the source data is very unusual and needs correction. As you can guess, the red crosses are of the greatest interest for us. To improve the quality of our sequencing data, we need to fix the parameters that are marked by red crosses.
Let’s explore what is wrong with the data in our example. We have the first red cross in the section titled, “Per base sequence quality”. We can see that the quality of the reads is very low near the end of reads. After 100 nucleotides, the average quality of reads drops to a Phred score of 20 and even lower. This means — if we recall the definition of quality — that the probability of an error in at the end of reads is 1% or higher, which is considered to be unacceptable. So, we should cut the end of reads, starting from the 100th nucleotide, in order to discard data of questionable quality.
The second problem is low per-tile sequence quality. This means that reads originating from some tiles of sequencer have much worse quality than the rest. The reason for this could be the bubbles formed in these tiles, or other issues with the sequencing platform. We should discard reads originating from such problem tiles.
The next problem is unusual per-base sequence content. Much higher occurrence of certain nucleotides at the beginning of the reads might result from the remnants of adapters and other auxiliary sequences introduced by researchers. To deal with this problem we should screen the data for these kind of sequences.
Our data also contain questionable length distribution and k-mer content, and corresponding sections are marked with an orange exclamation mark. FASTQC raises a warning if not all read sequences have the same length, although for some sequencing platforms this is entirely normal — in which case a warning about read length could be ignored. The last warning about k-mers shows us a picture similar to the per-base sequence count, which means that in the beginnings of reads some 7-nucleotide sequences are much more common than others.
Don’t forget to assess the quality of both files of paired reads for each sample. The quality is not necessarily the same for the two paired files.
In our next tutorial, we will demonstrate how to fix all issues highlighted above. Stay tuned!