Tutorial:Reference Assembly - Mapping Reads To A Reference Genome
Entering edit mode
7.8 years ago
Joseph Hughes ★ 2.9k

Working at the Centre for Virus Research where researchers often do re-sequencing of viruses, I often get asked how to do reference assemblies and generate a consensus. This is a guide for those wanting to learn a few freely available command-line tools. I have used this tutorial to introduce several non-bioinformaticians to the command-line and have then gone on from there to teach them how to create bash scripts to automate their analyses for large numbers of datasets.

Here, I will assume that you are starting from paired-end reads in two fastq files (readset_R1.fastq readset_R2.fastq) and a reference in fasta format (ref.fa). I will show you how to use two different aligner (BWA and Stampy). This is not intended to be an extensive demonstration of all the different programs for cleaning reads and aligning them to a reference but an illustration of the tools that I have found most useful.

Quality Control

One of the first things you will want to do when you get a set of reads is check the quality of those reads. FastQC is a very useful tool for this with an intuitive graphical user interface. The following point are important to note when looking at the different tabs in FastQC as they may have implication for the downstream analyses:

  1. In Basic Statistics check the Encoding. This is important as some software works with different versions of Illumina Encoding and require conversion.

  2. The per base sequence quality is probably the most important tab to check as it tells you how good your reads are overall. It will provide you with information on how many bases to trim from the 5'- or 3'-end and will give you an idea of the overall quality and quality threshold to use for trimming.

  3. The per base sequence content tells you whether you might still have primers or adaptors based on the non-random nucleotide content (zigzag at the beginning).

  4. The latter in combination with the overrepresented sequences tab, will help you to determine whether you need any particular trimming. In this table, you usually find the sequence of your adaptors.


Using trim_galore will help clean the sequence but you may need to change some of the default settings depending on the particular library prep you have used. You can check in FastQC afterwards to determine whether the patterns have improved. trim_galore produces sequences of heterogenous lengths. It keeps by default reads above 20bp, I usually find this a bit to short so increase it to 80. It also uses a default quality Phred score of 20.

trim_galore --paired --length 80 readset_R1.fastq readset_R2.fastq

The output names will have changed with _val_1.fq and _val_2.fq

I rarely find that the sequences need any further trimming but sometimes it might be useful to use PRINSEQ to trim the 5' end. PRINSEQ deals with the pairs separately, so there is a risk that the order of the reads may go out of sync and this could cause trouble downstream for some aligners that expect the reads in the two files to be identical and in the same order.

prinseq -fastq readset_R1_val_1.fq -trim_left 20

prinseq -fastq readset_R2_val_2.fq -trim_left 20

The output files will have a _prinseq_good_xxxx.fastq added.

The names of files start getting very long so sometimes it is useful to rename files to shorten and simplify them:

mv readset_R1_val_1_prinseq_good_yW3o.fastq readset_clean_R1.fq

mv readset_R2_val_2_prinseq_good_BkEF.fastq readset_clean_R2.fq

BWA assembly

Now that you have good quality reads, you can prepare the reference files for indexing with BWA. At this stage, you might want to check that the names of your sequences in the fasta reference file make sense, as this will be the name given to your consensus sequences later. Creating an index/database of your reference (this speeds up the mapping).

bwa index -a is ref.fa

Paired-end reads are aligned separately and then combined:

bwa aln ref readset_clean_R1.fq > readset_R1.sai

bwa aln ref readset_clean_R2.fq > readset_R2.sai

Then you need to combine the information from the two files:

bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam

The next steps involve creating a bam files which is a binary sam file (takes less space). Samtools is a critical tool in the arsenal for dealing with the alignment file (sam file).

samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam

or to remove unmapped reads at this stage

samtools view -F 4 -Sbh readset_ref_bwa.sam > readset_ref_bwa.bam

You can open up the bam file mapped onto the reference in tablet and visualize the alignment but first you need to sort it and create and index of the bam file

samtools sort readset_ref_bwa.bam readset_ref_bwa

samtools index readset_ref_bwa.bam

If you want to export the unmapped reads, I use bam2fastq. The # is needed when dealing with paired-end to export the unmapped reads from the bam assembly

bam2fastq --no-aligned --force --strict -o readset_ref_unmapped#.fq readset_ref_bwa.bam

Create the consensus. BEWARE samtools has changed a lot and these commands no longer work. The "bcftools call" command that replaces it produces unexpected results.

samtools mpileup -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_cons.fq

Increase sensitivity of samtools using -E especially when you have multiple nucleotide polymorphisms, for example if you have virus intra-host diversity

samtools mpileup -E -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_E_cons.fq

Convert the fastq format to fasta using seqret from emboss

seqret -osformat fasta readset_ref_bwa_E_cons.fq -out2 readset_ref_bwa_E_cons.fa

Stampy assembly

Stampy is another alignment tool which allows variation (particularly insertion/deletions) and thus can be used if you do not have a closely related reference. Create stampy reference and index:

stampy -G ref ref.fa

Hash table:

stampy -g ref -H ref

Stampy does mapping of paired-end reads together and you can set the divergence for mapping to a foreign reference with --substitutionrate option (default 0.001):

stampy -g ref -h ref -M readset_clean_R1.fq,readset_clean_R2.fq -o readset_stampy.sam -f sam --substitutionrate=0.05

Create bam file and consensus as before:

samtools view -S readset_stampy.sam -b -o readset_stampy.bam

samtools sort readset_stampy.bam readset_stampy

samtools index readset_stampy.bam

bam2fastq --no-aligned --force --strict -o readset_stampy_unmapped#.fq readset_stampy.bam

Be careful with the following commands as samtools and bcftools has changed a lot since I wrote this tutorial:

samtools mpileup -E -uf ref.fa readset_stampy.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_stampy_E_cons.fq 

seqret -osformat fasta readset_ref_stampy_E_cons.fq -out2 readset_ref_stampy_E_cons.fa

Once my students are familiar with using these command lines, I start teaching them how to put the commands together in a bash script to create an automated pipeline. Examples of these will come in a later tutorial.

mapping bwa samtools reference Tutorial • 44k views
Entering edit mode

Thank you for the useful information. I would like to ask how do we do a quality control analysis of the mapped reads to the reference genome. Once the reads are mapped I would like to know if we can check the number of reads that mapped to the reference genome both the individual bam file that got mapped also the combined bam file that got mapped to the ref genome. Is there any specific tool to do that? I am aware of FastQC but I will be working in my cluster so any suggestions which I can use to find the number of mapped reads with command line script that can be used on the mapped bam files to assess the number of reads that mapped. Please let me know if you can suggest anything.

Entering edit mode

Thanks. Very helpful for me. I am now going from SRA to variant calling, I have to do quality control for each short reads data file from SRA. If you have more of your expertise, please share here. +1

Entering edit mode

Thank you for the tutorial. However, I am unable to follow after samtools sort possibly due to different version. I am relatively new to bioinformatics, previously done de novo genome assembly but now would like to assemble based on a reference to compare, is it possible for you to update your tutorial? Or maybe point me to a better one for reference-based assemblies? My data is paired-end reads of a bacteria genome. Thanks in advance!

Entering edit mode
5.9 years ago
eden.maloney ▴ 10

Thank you for the tutorial! What code should replace:

samtools mpileup -uf ref.fa readset_ref_bwa.bam | bcftools view -cg - | vcfutils.pl vcf2fq > readset_ref_bwa_cons.fq

for samtools-1.2?


Entering edit mode

This is the answer for updated samtools and a new macbook pro

bwa index -a bwtsw 1000.fasta

  bwa aln 1000.fasta ERR244000_1.fastq > aln_sa1b.sai

  bwa aln 1000.fasta ERR244000_2.fastq > aln_sa2b.sai

  bwa sampe 1000.fasta aln_sa1b.sai aln_sa2b.sai ERR244000_1.fastq ERR244000_2.fastq > readset_refb_bwa.sam

  samtools view -bS -o readset_refb_bwa.bam readset_refb_bwa.sam

  samtools sort readset_refb_bwa.bam readset_refb_bwa.sorted

  samtools index readset_refb_bwa.sorted.bam

  samtools mpileup -d8000 -uf 1000.fasta readset_refb_bwa.sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > cnsnew.fq

Entering edit mode

Unfortunately that command doesn't produce the majority consensus in the same way that it use to when I wrote this tutorial. I have been unable so far to find the command that reproduces the consensus calling from this tutorial. Also keep in mind that INDELs are not handled in any of these consensus calling commands.

Entering edit mode

Dear Joseph, I'm aware of the date of your original post, but anyways: did you manage to find a replacement for the original pipeline (the one using bcftools view -cg -)? I'm now struggling with similar task and following pipeline using bcftools call:

samtools mpileup -ABuf reference.fasta mapped.bam | bcftools call -c | vcfutils.pl vcf2fq > mapped.fasta

gives me the "best" consensus I can get (i.e. looking as I would expect the consensus to look like when visually expecting the bam file in e.g. IGV). But still I need to tune the criteria for the consensus more - e.g. the above command gives consensus which shows reference nucleotides for regions with read coverage 1 while I'd like to include nucleotides coming from even a single read into the consensus. I would greatly appreciate any suggestions.


Login before adding your answer.

Traffic: 2195 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6