Reads cleaning and filtering is an important pre-processing step of the raw sequencing data. In our previous tutorial we explored the quality of the raw sequencing data and demonstrated how to correctly interpret results from the FastQC quality reports. Based on the results of the quality assessment, we will now do the cleaning and filtering of the sequencing reads. In this new tutorial we are going to filter and trim sequences using Trimmomatic and RemoveBadTiles tools. We explain how different parameters affect quality of filtering and show tricks to improve your data quality. We continue to use an open dataset from a recently published paper with an exciting theme: a study of a multi-drug resistance in Mycobacterium tuberculosis. This is a second tutorial from a set of tutorials on read quality improvement.
In our previous tutorial on FastQC, we explored the quality of the raw sequencing data. Based on the results of the quality assessment, we will now do the cleaning and filtering of the sequencing reads.
- Removing reads associated with bad tiles Our first step for improving the quality of the data is to remove reads originating from the problematic tiles. We will use a special custom tool called ‘remove bad tiles’ for this purpose. This tool takes unprocessed FASTQ files as an input, and it outputs FASTQ files in which any reads associated with bad tiles have been removed. Because remove bad tiles only accepts ungzipped input files, we should start by un-archiving our raw sequencing data.
We will first learn how to ungzip files in the command line. Navigate to the ‘Files’ tab and click on the ‘Console’ button. Navigate to the ‘Tuberculosis’ folder using the command ‘cd’. The cd command is a UNIX command that allows a user to change a directory. Into the Console, type:
and press enter.
To ungzip the file we should use the zcat command. The zcat command has the following syntax:
zcat [INPUT] > [PATH/NAME OF THE OUTPUT FILE].
Using our example data, we should type:
zcat 2009-08_lib324_miseq_r0030_251bp_R1.fastq.gz > /data/userXXX/Tuberculosis/2009-08_lib324_miseq_r0030_251bp_R1.fastq
Don’t forget to replace XXX with your own user ID. Your user ID is found in the header of the Console.
You can check that you have the FASTQ files ungzipped in the Tuberculosis folder, either in the File Manager or in the Console. The command that lists all the files in the current directory is ls. So, type into the Console:
Don’t forget to ungzip the second file of the paired-end reads.
Now we are ready to remove the reads associated with the bad tiles. In order to submit remove bad tiles to the InsideDNA cloud cluster, we will use the same wrapper that we previously used to run FastQC:
idna_submit.py [-t TASKNAME] [-c CPU] [-r RAM] [-e “COMMAND”]
The remove bad tiles tool has the following parameters:
idna_remove_bad_tiles –in [PATH/NAME OF THE INPUT FILE] –out [PATH/NAME OF THE OUTPUT FILE]
Therefore, in order to submit the analysis, you should type into the Console:
idna_submit.py -t tub_remove_bad_tiles -c 2 -r 7.5 -e "idna_remove_bad_tiles --in /data/userXXX/Tuberculosis/2009-08_lib324_miseq_r0030_251bp_R1.fastq --out /data/userXXX/Tuberculosis/Remove_bad_tiles_R1.fastq"
As usual, replace XXX with your own user ID.
Once you press enter, the task is submitted to the cloud-based cluster of InsideDNA.
When the task is completed (you can monitor task progress in the Tasks section), create a new FastQC report, to verify that the tile-associated filtering has worked. You will now see that the red cross in the section “Per tile quality” has turned into a green tick, suggesting that remove bad tiles has successfully fixed the tile issue.
- Fixing per base sequence quality and sequence content To deal with the per base sequence quality and any per base sequence content issues, we will use a popular tool called Trimmomatic. Trimmomatic can fix both issues in a single run. Again, we will use the InsideDNA submission script to run the analysis in a cloud-based cluster:
idna_submit.py [-t TASKNAME] [-c CPU] [-r RAM] [-e “COMMAND”]
For Trimmomatic tool the syntax is:
idna_trimmomatic [PE/SE] -encoding [PATH/NAME OF THE INPUT FILE OR FILES] [PATH/NAME OF THE OUTPUT FILES] [OPTIONS OF READS PROCESSING]
You can find detailed information about Trimmomatic here.
We will process both FASTQ files with the paired-end reads simultaneously. To turn the paired-end option on, we need to specify the parameter ‘PE’. Don’t forget that you need to input files that have been processed with the remove bad tiles tool and not the original, unprocessed files.
We will trim nucleotides that have bad quality at the end of the reads and will use 100th nucleotide position as the cropping point (parameter CROP:100). In addition, we will trim any so-called ‘overrepresented sequences’ from the ends of the reads. To clip them, we should specify parameter HEADCROP up to 15th nucleotide position (HEADCROP:15). We also need to screen our data for the remnants of auxiliary sequences (e.g. adaptors, contaminants, barcodes). We will use a default file called TruSeq3-PE.fa, which contains standard Illumina auxiliary sequence. You can use any additional file with your own sequences that you think may have contaminated the data. Clearly, you should first upload the file with such sequences to the project folder and enter the ‘ILLUMINACLIP:’ followed by the full file path (starting from data/userXXX/).
The next useful step is to trim both ends of the reads; this is because the ends of the reads generally have lower quality than the middle of the sequence. We won’t allow any long stretches of reads with quality below a certain threshold (in our case – 20 Phred score). To check this, we will use the option SLIDINGWINDOW. SLIDINGWINDOW means that for each consequent n nucleotide, where n is the size of the window, Trimmomatic will calculate an average Phred quality and cut that part of the read if the average quality in the window is lower than the threshold value. The command SLIDINGWINDOW:4:20 sets the size of the window to 4 nucleotides, and the threshold quality value to a minimum of 20. The last option we will use is called MINLEN, and it will discard any reads that became too short (shorter than 50 nucleotides) during the pre-processing.
Altogether, you should type in the Console:
idna_submit.py -t tub_trimmomatic -c 8 -r 7.2 -e "idna_trimmomatic PE -phred33 /data/userXXX/Tuberculosis/Remove_bad_tiles_R1.fastq /data/userXXX/Tuberculosis/Remove_bad_tiles_R2.fastq /data/userXXX/Tuberculosis/R1_paired.fq /data/userXXX/Tuberculosis/R1_unpaired.fq /data/userXXX/Tuberculosis/R2_paired.fq /data/userXXX/Tuberculosis/R2_unpaired.fq ILLUMINACLIP:/data/userXXX/Tuberculosis/TruSeq3-PE.fa:2:30:10 HEADCROP:15 CROP:100 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50"
XXX should be replaced by your user id.
When the task is completed, you will see four new files in the Tuberculosis folder. Files with the suffix “unpaired” contain reads that lost their pairs during Trimmomatic processing. For our downstream processing, we will use the reads for which we have both read pairs preserved.
To assess the quality of processed data you should run FastQC for the files with the index “paired”. You will see that there are no longer any red crosses in the FastQC report for these files, which suggests that our processing was successful.
Now we can use these reads for any downstream analysis. Congrats on finishing read cleaning and stay tuned!