Question: Can Anyone Suggest Me A Script Based Pipeline For Exome Sequencing With Paired End Reads Generated By Illumina For Tumor Samples.
gravatar for vchris_ngs
3.3 years ago by
Milan, Italy
vchris_ngs2.7k wrote:

Can anyone suggest me a pipeline with scripts for exome sequencing starting from the raw reads(paired end) till calling SNP, INDELS and then viewing the aligned file in IGV , it would be good if anyone has worked with GATK(genome analysis toolkit). I am new to exome sequencing data analysis , I have a proposed pipeline which I have made but I cannot understand it properly , so if anyone can help me out who has already in depth knowledge and experience in this area it would be of great help. My pipeline is below. I want to know how to use it with the scripts and call the variants and also find out the list of genes causing the mutations.

Align samples to reference genome (BWA), generates SAI files.

Steps pipeline:

  1. Convert SAI to SAM (BWA)
  2. Convert SAM to BAM binary format (SAM Tools)
  3. Sort BAM(SAM Tools)
  4. Index BAM (SAM Tools)
  5. Identify target regions for realignment (Genome Analysis Toolkit)
  6. Realign BAM to get better Indel calling (Genome Analysis Toolkit)
  7. Reindex the realigned BAM (SAM Tools)
  8. Call Indels (Genome Analysis Toolkit)
  9. Call SNPs (Genome Analysis Toolkit)
  10. View aligned reads in BAM/BAI (Integrated Genome Viewer)

Does anyone have a script to perform this analysis for understanding. I also have a basic script which I am attaching below.

Standard exome sequencing pipeline

Preparation of the input files

tar -xzf chromFa.tar.gz

Then concatenate the single-chromosome files to a single genome reference file (make sure they are in the exact same order as stated below, GATK won't work otherwise): cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa \ chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa \ chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa

aligning the sequences to the human genome I use BWA ( also why we are doing this step if this is not the actual alignment step)

bwa index -a bwtsw -p hg19 hg19.fa

Actual Alignment

If this is done you could start aligning you fastq files to that by invoking BWA like this:

bwa aln -t 4 -f input.sai -I hg19 input.fastq

For a sample called Exome1 that would in my case look like „@RG\tID:Exome1\tLB:Exome1\tSM:Exome1\tPL:ILLUMINA“ (not being able to understand the below line)

bwa sampe -f out.sam -r "@RQ\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:ILLUMINA"\
hg19 input1.sai input2.sai input1.fq input2.fq

SAM to BAM conversion

java -Xmx4g \
    -jar picard/SortSam.jar \
    SO=coordinate \
    INPUT=input.sam \
    OUTPUT=output.bam \

Marking PCR Duplicates

java -Xmx4g \
    -jar picard/MarkDuplicates.jar \
    INPUT=input.bam \
    OUTPUT=input.marked.bam \
    METRICS_FILE=metrics \
    CREATE_INDEX=true \

Local realignment around indels


java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T RealignerTargetCreator \
    -R hg19.fa \
    -o input.bam.list \
    -I input.marked.bam</p>

This step puts the table in the file in input.bam.list. When this is finished we can start the realigning step using the statements below:

java -Xmx4g \
    -jar GenomeAnalysisTK.jar \
    -I input.marked.bam \
    -R hg19.fa \
    -T IndelRealigner \
    -targetIntervals input.bam.list \
    -o input.marked.realigned.bam

When using paired end data, the mate information must be fixed, as alignments may change during the realignment process. Picard offers a utility to do that for us:

java \
    -jar picard/FixMateInformation.jar \
    INPUT=input.marked.realigned.bam \
    OUTPUT=input_bam.marked.realigned.fixed.bam \
    SO=coordinate \

Quality score recalibration

That's still not all. Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary (See as well). Again this is done in two steps: the CountCovariates step and the TableRecalibration steps. Both can be run from the GATK package:

  1. Count covariates:

    java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R hg19.fa \ --DBSNP dbsnp132.txt \ -I input.marked.realigned.fixed.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile input.recal_data.csv

This step creates a .csv file which is needed for the next step and requires a dbSNP file, which can be downloaded at the UCSC Genome browser homepage DbSNP132 is the most novel one which can be downloaded from the UCSC browser, but dbSNP is updated regularly, so newer versions will be available in the future. Download the dbsnp132.txt.gz file and unzip it using gunzip (that's just an example).

  1. Table recalibration:

    java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R hg19.fa \ -I input.marked.realigned.fixed.bam \ -T TableRecalibration \ --out input.marked.realigned.fixed.recal.bam \ -recalFile input.recal_data.csv

SNP calling

Produce raw SNP calls

SNP calling is done using the GATK UnifiedGenotyper program. It calls SNPs and short indels at the same time and gives a well annotated VCF file as output.

java -Xmx4g -jar GenomeAnalysisTK.jar \
    -glm BOTH \
    -R hg19.fa \
    -T UnifiedGenotyper \
    -I input.marked.realigned.fixed.recal.bam \
    -D dbsnp132.txt \
    -o snps.vcf \
    -metrics snps.metrics \
    -stand_call_conf 50.0 \
    -stand_emit_conf 10.0 \
    -dcov 1000 \
    -A DepthOfCoverage \
    -A AlleleBalance \
    -L target_intervals.bed

filter SNPs

Although this step is called filtering, I usually don't throw out possible wrong SNP calls and sometimes it proved to be useful to get back to those SNPs in a later step in the analysis. I prefer to flag them according to the reason why they should be filtered. The filtering scheme are partially the recommended ones by the GATK team and some are based on my experience. A SNP which passes through all the filters doesn't necessarily mean a true SNP call and SNPs filtered out don't necessarily define a sequencing artifact, but it gives a clue for possible reasons why a SNP could be wrong. (In case you've got several exomes (>30) Variant Quality Score recalibration will yield better results than pure filtering. For details see

java -Xmx4g -jar GenomeAnalysisTK.jar \
    -R hg19.fa \
    -T VariantFiltration \
    -B:variant,VCF snp.vcf.recalibrated \
    -o snp.recalibrated.filtered.vcf \
    --clusterWindowSize 10 \
    --filterExpression "MQ0 &gt;= 4 &amp;&amp; ((MQ0 / (1.0 * DP)) &gt; 0.1)" \
    --filterName "HARD_TO_VALIDATE" \
    --filterExpression "DP &lt; 5 " \
    --filterName "LowCoverage" \
    --filterExpression "QUAL &lt; 30.0 " \
    --filterName "VeryLowQual" \
    --filterExpression "QUAL &gt; 30.0 &amp;&amp; QUAL &lt; 50.0 " \
    --filterName "LowQual" \
    --filterExpression "QD &lt; 1.5 " \
    --filterName "LowQD" \
    --filterExpression "SB &gt; -10.0 " \
    --filterName "StrandBias"

Annotations using annovar

Conversion to annovar file format

For annotating SNP calls I use the software annovar ( It annotates a lot of different data to the SNPs and is especially suited for exome-level data-sets. At first we need to convert the VCF file format to the annovar file format. Annovar got it's own script to do that for us --format vcf4 --includeinfo snp.recalibrated.filtered.vcf &gt; snp.annovar

include the –includeinfo argument as this will move the annotations from GATK (filters, SNP quality scores and everything else) to the annovar file. Another script annotates the annovar file. This script needs some annotation files, all of which can be downloaded at their homepage

Be sure to get all the hg19_xxx files if you've done alignment on the hg19 human assembly and save it in the humandb subfolder of the annovar folder. The script then produces a comma-separated text file with all the annotations, which can be viewed in Excel, OpenOffice Calc or similar programs. --buildver hg19 snp.annovar ./humandb -outfile snps

It would be of great help if anyone can come up with suggestions or some pipeline script.

ADD COMMENTlink modified 7 months ago • written 3.3 years ago by vchris_ngs2.7k

format your post please.

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum88k

I have tried to format it, please let me know if this looks good or not

ADD REPLYlink written 3.3 years ago by vchris_ngs2.7k

There, it's formatted now.

ADD REPLYlink written 7 months ago by Ram11k

Thanks a lot Ram but this post really is not worth the view, I was very new to this exome world those days and everything was going tangential to me. This post is a very rip off from the standard pipeline. I do not know if its worth keeping it and garnering so many views. I have now standard work flows for variant calling totally automated so I am not sure if this post would benefit anyone or not. It was a very naive question from my part battling RNA-Seq , understanding its world and thrown into world of variant in my lab all alone with no one having expertise then. But yes learnt a lot from everyone here over the years.

ADD REPLYlink written 7 months ago by vchris_ngs2.7k


you mentioned "I have now standard work flows for variant calling totally automated"

may you please share more about that or provide a link, because I am in a situation where you were at that time you shared this post

automated pipeline you mean something like SeqMule??

ADD REPLYlink written 12 weeks ago by F2.5k

I have not used SeqMule. My automated workflow is based on GATK workflow where one can make variant calling on single samples. However, if you are interested in making somatic calls you have to run the same workflow on both normal and tumor samples. Switch off the part from the Haplotype caller and use something like VarSca2 or mutect2 on the normal/tumor bam files to extract both somatic and germline variants. The variant pipeline is in the below link, fork it and exploit it for your own use.

Exome auto workflow with GATK

Note: There are a lot of dependencies associated with this workflow. You need to install all the mentioned tools that I have used in your HPC and parse their paths to make this workflow up and functional. If you use it and find some problems, I would encourage you to raise a ticket in github so that I can assist better. Good luck!

ADD REPLYlink written 12 weeks ago by vchris_ngs2.7k

Did you just copy/paste from What is your question?

ADD REPLYlink written 3.3 years ago by matted6.3k

yes that is the best pipeline but I am not being able to understand it.. it would be nice if anyone can help me out with this or propose me something better, I am having trouble with this pipeline..

ADD REPLYlink written 3.3 years ago by vchris_ngs2.7k
gravatar for Chris Miller
3.3 years ago by
Chris Miller17k
Washington University in St. Louis, MO
Chris Miller17k wrote:

What you're asking here is probably beyond the scope of a Q/A site. To properly review all of these steps and provide feedback and suggestions would take hours. If you really need that level of support, then you're going to want to pay someone a consulting fee to help you get your pipeline set up.

If you have specific questions about individual steps or commands, then Biostar can be a great resource, and please do feel free to ask questions. I'd encourage you to look through old posts first, as many of these topics have been addressed individually in the past.

ADD COMMENTlink written 3.3 years ago by Chris Miller17k

thanks a lot, I am doing it actually

ADD REPLYlink written 3.3 years ago by vchris_ngs2.7k
gravatar for Charles Warden
3.3 years ago by
Charles Warden4.6k
Duarte, CA
Charles Warden4.6k wrote:

I think "" is a helpful template that covers most or all of these questions. However, you'll need to know some Perl to know how to update the template to meet your needs:

ADD COMMENTlink written 3.3 years ago by Charles Warden4.6k
gravatar for vchris_ngs
3.3 years ago by
Milan, Italy
vchris_ngs2.7k wrote:

Can anyone tell me the which qualities should be used for the alignment

bwa aln -t 4 -f input.sai -I hg19 input.fastq

-I give the the bwa option to use Illumina1.3+ qualities. I am trying to run a test with one of the samples of 1000 genome project. Can anyone give me an idea how will I know which quality I should be using here? Also it would be nice if anyone can let me know that for Illumina 1.8+ qualities how will the command change? or 1.3+ indicates that it takes care of the higher versions as well. Please guide

ADD COMMENTlink written 3.3 years ago by vchris_ngs2.7k

This would probably be better off posted as a new question. It's likely that only the two or three people involved in this thread will notice that you've posted it here. That said, This previous question may help: What is the default quality encoding expected by BWA?

ADD REPLYlink written 3.3 years ago by Chris Miller17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour