Working at the Centre for Virus Research where researchers often do re-sequencing of viruses, I often get asked how to do reference assemblies and generate a consensus. This is a guide for those wanting to learn a few freely available command-line tools. I have used this tutorial to introduce several non-bioinformaticians to the command-line and have then gone on from there to teach them how to create bash scripts to automate their analyses for large numbers of datasets.
Here, I will assume that you are starting from paired-end reads in two fastq files (readset_R1.fastq readset_R2.fastq) and a reference in fasta format (ref.fa). I will show you how to use two different aligner (BWA and Stampy). This is not intended to be an extensive demonstration of all the different programs for cleaning reads and aligning them to a reference but an illustration of the tools that I have found most useful.
One of the first things you will want to do when you get a set of reads is check the quality of those reads. FastQC is a very useful tool for this with an intuitive graphical user interface. The following point are important to note when looking at the different tabs in FastQC as they may have implication for the downstream analyses:
In Basic Statistics check the Encoding. This is important as some software works with different versions of Illumina Encoding and require conversion.
The per base sequence quality is probably the most important tab to check as it tells you how good your reads are overall. It will provide you with information on how many bases to trim from the 5'- or 3'-end and will give you an idea of the overall quality and quality threshold to use for trimming.
The per base sequence content tells you whether you might still have primers or adaptors based on the non-random nucleotide content (zigzag at the beginning).
The latter in combination with the overrepresented sequences tab, will help you to determine whether you need any particular trimming. In this table, you usually find the sequence of your adaptors.
Using trim_galore will help clean the sequence but you may need to change some of the default settings depending on the particular library prep you have used. You can check in FastQC afterwards to determine whether the patterns have improved. trim_galore produces sequences of heterogenous lengths. It keeps by default reads above 20bp, I usually find this a bit to short so increase it to 80. It also uses a default quality Phred score of 20.
trim_galore --paired --length 80 readset_R1.fastq readset_R2.fastq
The output names will have changed with _val_1.fq and _val_2.fq
I rarely find that the sequences need any further trimming but sometimes it might be useful to use PRINSEQ to trim the 5' end. PRINSEQ deals with the pairs separately, so there is a risk that the order of the reads may go out of sync and this could cause trouble downstream for some aligners that expect the reads in the two files to be identical and in the same order.
prinseq -fastq readset_R1_val_1.fq -trim_left 20 prinseq -fastq readset_R2_val_2.fq -trim_left 20
The output files will have a _prinseq_good_xxxx.fastq added.
The names of files start getting very long so sometimes it is useful to rename files to shorten and simplify them:
mv readset_R1_val_1_prinseq_good_yW3o.fastq readset_clean_R1.fq mv readset_R2_val_2_prinseq_good_BkEF.fastq readset_clean_R2.fq
Now that you have good quality reads, you can prepare the reference files for indexing with BWA. At this stage, you might want to check that the names of your sequences in the fasta reference file make sense, as this will be the name given to your consensus sequences later. Creating an index/database of your reference (this speeds up the mapping).
bwa index -a is ref.fa
Paired-end reads are aligned separately and then combined:
bwa aln ref readset_clean_R1.fq > readset_R1.sai bwa aln ref readset_clean_R2.fq > readset_R2.sai
Then you need to combine the information from the two files:
bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam
samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam
or to remove unmapped reads at this stage
samtools view -F 4 -Sbh readset_ref_bwa.sam > readset_ref_bwa.bam
samtools sort readset_ref_bwa.bam readset_ref_bwa samtools index readset_ref_bwa.bam
bam2fastq --no-aligned --force --strict -o readset_ref_unmapped#.fq readset_ref_bwa.bam
Create the consensus. BEWARE samtools has changed a lot and these commands no longer work. The "bcftools call" command that replaces it produces unexpected results.
samtools mpileup -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_cons.fq
Increase sensitivity of samtools using -E especially when you have multiple nucleotide polymorphisms, for example if you have virus intra-host diversity
samtools mpileup -E -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_E_cons.fq
Convert the fastq format to fasta using seqret from emboss
seqret -osformat fasta readset_ref_bwa_E_cons.fq -out2 readset_ref_bwa_E_cons.fa
Stampy is another alignment tool which allows variation (particularly insertion/deletions) and thus can be used if you do not have a closely related reference. Create stampy reference and index:
stampy -G ref ref.fa
stampy -g ref -H ref
Stampy does mapping of paired-end reads together and you can set the divergence for mapping to a foreign reference with --substitutionrate option (default 0.001):
stampy -g ref -h ref -M readset_clean_R1.fq,readset_clean_R2.fq -o readset_stampy.sam -f sam --substitutionrate=0.05
Create bam file and consensus as before:
samtools view -S readset_stampy.sam -b -o readset_stampy.bam
samtools sort readset_stampy.bam readset_stampy
samtools index readset_stampy.bam
bam2fastq --no-aligned --force --strict -o readset_stampy_unmapped#.fq readset_stampy.bam
Be careful with the following commands as samtools and bcftools has changed a lot since I wrote this tutorial:
samtools mpileup -E -uf ref.fa readset_stampy.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_stampy_E_cons.fq
seqret -osformat fasta readset_ref_stampy_E_cons.fq -out2 readset_ref_stampy_E_cons.fa
Once my students are familiar with using these command lines, I start teaching them how to put the commands together in a bash script to create an automated pipeline. Examples of these will come in a later tutorial.