Tutorial:Tutorial (How to analyze) on Whole Exome sequencing. Common Errors. Best Practices.
Entering edit mode
9.1 years ago
Chirag Nepal ★ 2.4k

I have tried to document general workflow and potential errors you might come at each step. All these steps are pre-requisite if you want to run GATK/muTect, while not necessary for other SNV tools (Varscan2, SomaticSnipper, Shimmer). After you perform these steps, one can input BAM files into various SNV tools such as Varscan2, muTect, SomaticSnipper, Shimmer. There are many other tools, but these are the tools I used.

Step 1

Map Pair-end FASTQ files using BWA. While mapping the reads, it might be possible the Read Group (RG) information may not be provided. If you want to use GATK, this is compulsory. RG can be inserted in bwa command with -r '@RG\tID:foo\tSM:bar', or you could insert during pre-processing of BAM files using PicardTools. "I had already mapped BAM files, hence I inserted RG info using PicardTools. I used dummy values in RG files, so as to bypass the error."

Step 2

If you have multiple files, loop all the file names.

for name in Sample_ST25_sorted.bam Sample_ST26_sorted.bam


Step 2.1

Picard does not accept the assigned flag (from BWA) for unmapped reads SAM/BAM from BWA. I fixed simply by removing the unmapped reads, as I am focused only on unique mapping reads. However, you could also change the flag and proceed.

For detail discussions see this post on Biostars-: Final Solution For "Mapq Should Be 0 For Unmapped Read."

samtools view -bF 4 -q 1 $name > $RealignedBamDir/$output1

Step 2.2:

Fix the reads and sort the bam files based on their coordinates. Even though I had sorted file from samtools, it gave me errors while running on Picard. So it is recommended to use Picard to sort BAM files. It is good to provide the option TMP_DIR, instead of using the system temp_dir, since you might an error for large files. Rule of thumb: Always use VALIDATION_STRINGENCY=STRICT.

java -Xmx20g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25 -jar $picard/FixMateInformation.jar \
I=$RealignedBamDir/$output1 \
O=$RealignedBamDir/$output2 \
TMP_DIR=`pwd`/tmpA25       \

Step 2.3

Reorder bam files in the order of fasta

java -Xmx8g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25A -jar $picard/ReorderSam.jar \
I=$RealignedBamDir/$output2 \
O=$RealignedBamDir/$output3 \
R=$Reference \
TMP_DIR=`pwd`/tmpA25A       \


You might encounter this error "CIGAR M operator maps off end of reference". If such error appears, Clean BAM. These errors are generally found when certain reads map to the end of chromosomes and beyond, or if the reference assembly used is different. For eg, see this post Should all the HG19 reference be the same? to see how the UCSC assembly and Ensembl assembly differ in chrM.

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/CleanSam.jar \
I=$RealignedBamDir/$output3 \

Step 2.5

Mark duplicates

 java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/MarkDuplicates.jar \
I=$RealignedBamDir/$output4 \
O=$RealignedBamDir/$output5 \
M=$RealignedBamDir/File$name \

Step 2.6

Add ReadGroup (RG). If RG info are provided during mapping, this step can be excluded. Here, I have stored dummy variables.

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/AddOrReplaceReadGroups.jar \
I=$RealignedBamDir/$output5 \
O=$RealignedBamDir/$output6 \
RGID=group1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=sample1

Step 2.7

Create BAM index

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output6

Step 2.8

Create List of all INDELs

java -Xmx8g -XX:ParallelGCThreads=8  -jar $GATK/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R $Reference \
-I $RealignedBamDir/$output6 \
-known $Indel1000G \
-o $RealignedBamDir/$intervalList

Step 2.9

Make new realigned BAM files

   java -Xmx8g -XX:ParallelGCThreads=8 -jar $GATK/GenomeAnalysisTK.jar \
    -T IndelRealigner \
    -R $Reference \
    -I $RealignedBamDir/$output6 \
    -targetIntervals $RealignedBamDir/$intervalList \
    -known $Indel1000G \
    -o $RealignedBamDir/$output7

Step 2.10

Create BAM index of realigned BAM files

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output7


Step 3

Now the BAM files can be feed to any SNV tools, listed above or other existing tools of your choice. If you further want to re-process your data, you could add recalibration step from GATK and have the final recalibrated BAM. If you do the recalibration step, it is always nice to compare the SNV list, before and after recalibration.</h3>

Step 4

Run SNV tools on each BAM files.

The overlap (somatic mutations) between various SNV tools are notoriously low (as it seems from my experience and few published articles), I would recommend using at least two tools.

Step 4.1

Run muTect. You might encounter an error with newer version of Java. For some unknown reasons, I could run muTect only with older version, i.e, Java6.

/usr/lib/jvm/java-6-openjdk-amd64/bin/java -jar $muTect \
  --analysis_type MuTect \
  --reference_sequence $sequence \
  --cosmic $cosmicData \
  --dbsnp $dbsnp138 \
  --input_file:normal $P1N \
  --input_file:tumor $P1T \
  --out P1Out \
  --coverage_file P1.wig

Now, you can filter the output by using awk command:

cat P1Out  | grep -v REJECT > P1Out.filtered

Step 4.2 Run Varscan2

Filter high confidence prediction

Step 4.3 Run Somatic Sniper

Filter high-confidence prediction


Merge the somatic calls for all tools and see how much overlap between these tools. Do the downstream analysis.

muTect GATK Picard Exome • 13k views
Entering edit mode

it's rather de-moralizing that it requires this many steps to get a usable bam. that's one reason to prefer samtools or freebayes (or platypus) variant callers.

Entering edit mode

I don't think using alternative variant callers really has much impact on pre-processing steps. And some of them can be easily combined or skipped except cases where they fail. But marking duplicate reads, ensuring rads are sorted in the right order, etc are all necessary (or at least highly recommended) regardless of what variant caller you ultimately want to use. I've seen some suggestions where FreeBayes and Platypus weren't as sensitive to needing to do local realignment as UnfiedGenotyper or HaplotypeCaller but it was still recommended.

Entering edit mode

Don't forget to add -M argument to bwa-mem, if you are using picard as downstream tool.

Entering edit mode
9.1 years ago
rbagnall ★ 1.8k

Nice summary, although recent of Picard has change how tools are invoked


Entering edit mode
9.1 years ago
Sam ★ 4.7k

From our experience, it is usually better to perform piping between commands as the IO burden to the server can be huge when you try to perform multiple of these analysis. An example will be:

$bwa mem -M -t $thread -R "@RG\tID:$id\tPL:$platform\tPU:"$sampleName"_$processDay\tSM:$sampleName" $refPre $read1 $read2 | $samtools sort -o - - |  $samtools view -bSh - > "$sampleName".sorted.bam

This command can perform the alignment and sorting and creating the bam required for Markduplication (might need double check as I haven't use this update command yet). By saving up the IO burden, you can avoid things like crashing the server etc. I think Brad is trying to do something along this line with his bcbio

Entering edit mode

I wrote a guide (Piping With Samtools, Bwa And Bedtools) a while back on how to pipe and save on I/O. Keep in mind though that some tasks (esp those with slow steps) might be better done in parallel rather than pipe

Entering edit mode
9.1 years ago

May I suggest you to include this information in GitHub so we can fork into it?

Entering edit mode

The new Biostars (Biostars 3.0) will have the option of storing larger posts (such as the above) in the github repository so that one can contribute to them

Entering edit mode
9.0 years ago
iraun 6.1k

Hi all,

Thanks Chirag for the post, it very useful. I was wondering, as Dan Gaston has pointed out, how can some of those step can be skipped. I want to put all the steps in a script, but I want to skip as much as possible to increase execution speed. So, for example, for step 2.6 (add read group), it is possible to check if the bam contains RG with a simple command:

if samtools view -H FILE.bam | grep --quiet '^@RG'; then echo "RG included. Go to 2.7; else; Step 2.6; fi

There is anyway to avoid for example step 2.4 (CIGAR M), without executing it? or any other step?

Entering edit mode

In this tutorial i have tried to document all possible steps where one might get an error, and it is not necessary that everybody will have all the errors.

For eg: If one provide RG info while mapping, that step can be skipped. Similarly, if one has created BAM files themselves, it is almost certain that you won't have "CIGAR M "error. So, If you want to exclude the step 2.4,. you could simply pass the output from step 2.3 to 2.5.

Entering edit mode

Yes, I know that you've documented all possible steps with errors. It's nice :).
One thing, what do you mean with "f one has created BAM files themselves"?

Entering edit mode

Sometimes, we get processed BAM files from others. And it is possible that different version of reference assembly, such hg19.fa, UCSChg39.fa, could probably lead to this error.

Entering edit mode

Ok, understood. Thank you Chirag.


Login before adding your answer.

Traffic: 1417 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6