Tutorial: Tutorial (How to analyze) on Whole Exome sequencing. Common Errors. Best Practices.
39
gravatar for Chirag Nepal
4.6 years ago by
Chirag Nepal2.2k
Copenhagen
Chirag Nepal2.2k wrote:

I have tried to document general workflow and potential errors you might come at each step. All these steps are pre-requisite if you want to run GATK/muTect, while not necessary for other SNV tools (Varscan2, SomaticSnipper, Shimmer). After you perform these steps, one can input BAM files into various SNV tools such as Varscan2, muTect, SomaticSnipper, Shimmer. There are many other tools, but these are the tools i used.

Step 1

Map Pair-end FASTQ files using BWA. While mapping the reads, it might be possible the Read Group (RG) information may not be provided. If you want to use GATK, this is compulsory. RG can be inserted in bwa command with -r '@RG\tID:foo\tSM:bar', or you could insert during pre-processing of BAM files using PicardTools." I had already mapped BAM files, hence i inserted RG info using PicardTools. I used dummy values in RG files, so as to bypass the error."

Step 2

If you have multiple files, loop all the file names.

for name in Sample_ST25_sorted.bam Sample_ST26_sorted.bam
do
    output1=${name%_sorted.bam}_Filtered.bam
    output2=${name%_sorted.bam}_Filtered_SortedFixed.bam
    output3=${name%_sorted.bam}_Filtered_SortedFixed_Reorder.bam
    output4=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean.bam
    output5=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean_MarkDup.bam
    output6=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean_MarkDup_AddRG.bam
    output7=${name%_sorted.bam}_IndelRealigned.bam

    intervalList=${name%_sorted.bam}_Interval.list

Step 2.1

Picard does not accept the assigned flag (from BWA) for unmapped reads SAM/BAM from BWA. I fixed simply by removing the unmapped reads, as i am focused only on unique mapping reads. However, you could also change the flag and proceed.

For detail discussions see this post on Biostars-: Final Solution For "Mapq Should Be 0 For Unmapped Read."

 samtools view -bF 4 -q 1 $name > $RealignedBamDir/$output1

Step 2.2:

Fix the reads and sort the bam files based on their coordinates. Even though i had sorted file from samtools, it gave me errors while running on Picard. So it is recommended to use Picard to sort BAM files. It is good to provide the option "TMP_DIR", instead of using the system temp_dir, since you might an error for large files. Rule of thumb: Always use VALIDATION_STRINGENCY=STRICT".

java -Xmx20g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25 -jar $picard/FixMateInformation.jar \
I=$RealignedBamDir/$output1 \
O=$RealignedBamDir/$output2 \
MAX_RECORDS_IN_RAM=2000000 \
TMP_DIR=`pwd`/tmpA25       \
 SO=coordinate VALIDATION_STRINGENCY=STRICT

Step 2.3

Reorder bam files in the order of fasta

java -Xmx8g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25A -jar $picard/ReorderSam.jar \
I=$RealignedBamDir/$output2 \
O=$RealignedBamDir/$output3 \
R=$Reference \
TMP_DIR=`pwd`/tmpA25A       \
VALIDATION_STRINGENCY=STRICT

Step-2.4

You might encounter this error "CIGAR M operator maps off end of reference". If such error appears, Clean BAM. These errors are generally found when certain reads map to the end of chromosomes and beyond, or if the reference assembly used is different. For eg, see this post Should all the HG19 reference be the same? to see how the UCSC assembly and Ensembl assembly differ in chrM.

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/CleanSam.jar \
I=$RealignedBamDir/$output3 \
O=$RealignedBamDir/$output4

Step 2.5

Mark duplicates

 java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/MarkDuplicates.jar \
I=$RealignedBamDir/$output4 \
O=$RealignedBamDir/$output5 \
M=$RealignedBamDir/File$name \
VALIDATION_STRINGENCY=STRICT

Step 2.6

Add ReadGroup (RG). If RG info are provided during mapping, this step can be excluded. Here, i have stored dummy variables.

java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/AddOrReplaceReadGroups.jar \
I=$RealignedBamDir/$output5 \
O=$RealignedBamDir/$output6 \
RGID=group1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=sample1

Step 2.7

Create BAM index

 java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output6

Step 2.8

Create List of all INDELs"

java -Xmx8g -XX:ParallelGCThreads=8  -jar $GATK/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R $Reference \
-I $RealignedBamDir/$output6 \
-known $Indel1000G \
-o $RealignedBamDir/$intervalList

Step 2.9

Make new realigned BAM files

   java -Xmx8g -XX:ParallelGCThreads=8 -jar $GATK/GenomeAnalysisTK.jar \
    -T IndelRealigner \
    -R $Reference \
    -I $RealignedBamDir/$output6 \
    -targetIntervals $RealignedBamDir/$intervalList \
    -known $Indel1000G \
    -o $RealignedBamDir/$output7

Step 2.10

Create BAM index of realigned BAM files

 java -Xmx8g -XX:ParallelGCThreads=8  -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output7

done

 

Step 3: Now the BAM files can be feed to any SNV tools, listed above or other existing tools of your choice. If you further want to re-process your data, you could add recalibration step from GATK and have the final recalibrated BAM. If you do the recalibration step, it is always nice to compare the SNV list, before and after recalibration.

 

Step 4: Run SNV tools on each BAM files.

The overlap (somatic mutations) between various SNV tools are notoriously low (as it seems from my experience and few published articles), i would recommend using at least two tools.

 

# Step4.1 Run muTect.  You might encounter an error with newer version of Java. For some unknown reasons, i could run muTect only with older version, i.e, Java6.

/usr/lib/jvm/java-6-openjdk-amd64/bin/java -jar $muTect --analysis_type MuTect --reference_sequence $sequence --cosmic $cosmicData --dbsnp $dbsnp138 --input_file:normal $P1N --input_file:tumor $P1T --out P1Out --coverage_file P1.wig

Now, you can filter the output by using awk command:

cat P1Out  | grep -v REJECT > P1Out.filtered

 

# Step4.2 Run Varscan2

Filter high confidence prediction

 

# Step4.3 Run Somatic Snipper

Filter high-confidence prediction

 

# Step-5

Merge the somatic calls for all tools and see how much overlap between these tools. Do the downstream analysis.

 

 

 

mutect picard tutorial gatk exome • 9.7k views
ADD COMMENTlink modified 3.7 years ago • written 4.6 years ago by Chirag Nepal2.2k
2

it's rather de-moralizing that it requires this many steps to get a usable bam. that's one reason to prefer samtools or freebayes (or platypus) variant callers.

ADD REPLYlink written 4.6 years ago by brentp23k

I don't think using alternative variant callers really has much impact on pre-processing steps. And some of them can be easily combined or skipped except cases where they fail. But marking duplicate reads, ensuring rads are sorted in the right order, etc are all necessary (or at least highly recommended) regardless of what variant caller you ultimately want to use. I've seen some suggestions where FreeBayes and Platypus weren't as sensitive to needing to do local realignment as UnfiedGenotyper or HaplotypeCaller but it was still recommended.

ADD REPLYlink written 4.6 years ago by Dan Gaston7.1k

Don't forget to add -M argument to bwa-mem, if you are using picard as downstream tool.

ADD REPLYlink written 4.6 years ago by poisonAlien2.8k
4
gravatar for rbagnall
4.6 years ago by
rbagnall1.4k
Australia
rbagnall1.4k wrote:

Nice summary, although recent of Picard has change how tools are invoked

https://www.broadinstitute.org/gatk/blog?id=4786

 

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by rbagnall1.4k
4
gravatar for Sam
4.6 years ago by
Sam2.3k
New York
Sam2.3k wrote:

From our experience, it is usually better to perform piping between commands as the IO burden to the server can be huge when you try to perform multiple of these analysis. An example will be:

$bwa mem -M -t $thread -R "@RG\tID:$id\tPL:$platform\tPU:"$sampleName"_$processDay\tSM:$sampleName" $refPre $read1 $read2 | $samtools sort -o - - |  $samtools view -bSh - > "$sampleName".sorted.bam

This command can perform the alignment and sorting and creating the bam required for Markduplication (might need double check as I haven't use this update command yet). By saving up the IO burden, you can avoid things like crashing the server etc. I think Brad is trying to do something along this line with his bcbio

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Sam2.3k
2

I wrote a guide (Piping With Samtools, Bwa And Bedtools) a while back on how to pipe and save on I/O. Keep in mind though that some tasks (esp those with slow steps) might be better done in parallel rather than pipe

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Ying W3.9k
3
gravatar for Antonio R. Franco
4.6 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

May I suggest you to include this information in  GitHub so we can fork into it. ?

ADD COMMENTlink written 4.6 years ago by Antonio R. Franco4.0k
7

the new Biostars (Biostars 3.0)  will have the option of storing larger posts (such as the above) in the github repository so that one can contribute to them

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 80k
0
gravatar for iraun
4.6 years ago by
iraun3.6k
Norway
iraun3.6k wrote:

Hi all,


Thanks Chirag for the post, it very useful. I was wondering, as Dan Gaston has pointed out, how can some of those step can be skipped. I want to put all the steps in a script, but I want to skip as much as possible to increase execution speed. So, for example, for step 2.6 (add read group), it is possible to check if the bam contains RG with a simple command:

if samtools view -H FILE.bam | grep --quiet '^@RG'; then echo "RG included. Go to 2.7; else; Step 2.6; fi

There is anyway to avoid for example step 2.4 (CIGAR M), without executing it? or any other step?

 

ADD COMMENTlink written 4.6 years ago by iraun3.6k

In this tutorial i have tried to document all possible steps where one might get an error, and it is not necessary that everybody will have all the errors.

For eg: If one provide RG info while mapping, that step can be skipped. Similarly, if one has created BAM files themselves, it is almost certain that you won't have "CIGAR M "error. So, If you want to exclude the step 2.4,. you could simply pass the output from step 2.3 to 2.5.

ADD REPLYlink written 4.6 years ago by Chirag Nepal2.2k

Yes, I know that you've documented all possible steps with errors. It's nice :).
One thing, what do you mean with "f one has created BAM files themselves"?

ADD REPLYlink written 4.6 years ago by iraun3.6k

Sometimes, we get processed BAM files from others. And it is possible that different version of reference assembly, such hg19.fa, UCSChg39.fa, could probably lead to this error.

ADD REPLYlink written 4.6 years ago by Chirag Nepal2.2k

Ok, understood. Thank you Chirag.

ADD REPLYlink written 4.6 years ago by iraun3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2172 users visited in the last hour