Can anyone suggest me a pipeline with scripts for exome sequencing starting from the raw reads(paired end) till calling SNP, INDELS and then viewing the aligned file in IGV , it would be good if anyone has worked with GATK(genome analysis toolkit). I am new to exome sequencing data analysis , I have a proposed pipeline which I have made but I cannot understand it properly , so if anyone can help me out who has already in depth knowledge and experience in this area it would be of great help. My pipeline is below. I want to know how to use it with the scripts and call the variants and also find out the list of genes causing the mutations.
Align samples to reference genome (BWA), generates SAI files.
- Convert SAI to SAM (BWA)
- Convert SAM to BAM binary format (SAM Tools)
- Sort BAM(SAM Tools)
- Index BAM (SAM Tools)
- Identify target regions for realignment (Genome Analysis Toolkit)
- Realign BAM to get better Indel calling (Genome Analysis Toolkit)
- Reindex the realigned BAM (SAM Tools)
- Call Indels (Genome Analysis Toolkit)
- Call SNPs (Genome Analysis Toolkit)
- View aligned reads in BAM/BAI (Integrated Genome Viewer)
Does anyone have a script to perform this analysis for understanding. I also have a basic script which I am attaching below.
Standard exome sequencing pipeline
Preparation of the input files
tar -xzf chromFa.tar.gz
Then concatenate the single-chromosome files to a single genome reference file (make sure they are in the exact same order as stated below, GATK won't work otherwise): cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa \ chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa \ chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa
aligning the sequences to the human genome I use BWA ( also why we are doing this step if this is not the actual alignment step)
bwa index -a bwtsw -p hg19 hg19.fa
If this is done you could start aligning you fastq files to that by invoking BWA like this:
bwa aln -t 4 -f input.sai -I hg19 input.fastq
For a sample called Exome1 that would in my case look like „@RG\tID:Exome1\tLB:Exome1\tSM:Exome1\tPL:ILLUMINA“ (not being able to understand the below line)
bwa sampe -f out.sam -r "@RQ\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:ILLUMINA"\ hg19 input1.sai input2.sai input1.fq input2.fq
SAM to BAM conversion
java -Xmx4g -Djava.io.tmpdir=/tmp \ -jar picard/SortSam.jar \ SO=coordinate \ INPUT=input.sam \ OUTPUT=output.bam \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true</p>
Marking PCR Duplicates
java -Xmx4g -Djava.io.tmpdir=/tmp \ -jar picard/MarkDuplicates.jar \ INPUT=input.bam \ OUTPUT=input.marked.bam \ METRICS_FILE=metrics \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT
Local realignment around indels
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R hg19.fa \ -o input.bam.list \ -I input.marked.bam</p>
This step puts the table in the file in
input.bam.list. When this is finished we can start the realigning step using the statements below:
java -Xmx4g -Djava.io.tmpdir=/tmp \ -jar GenomeAnalysisTK.jar \ -I input.marked.bam \ -R hg19.fa \ -T IndelRealigner \ -targetIntervals input.bam.list \ -o input.marked.realigned.bam
When using paired end data, the mate information must be fixed, as alignments may change during the realignment process. Picard offers a utility to do that for us:
java -Djava.io.tmpdir=/tmp/flx-auswerter \ -jar picard/FixMateInformation.jar \ INPUT=input.marked.realigned.bam \ OUTPUT=input_bam.marked.realigned.fixed.bam \ SO=coordinate \ VALIDATION_STRINGENCY=LENIENT \ CREATE_INDEX=true
Quality score recalibration
That's still not all. Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary (See http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf as well). Again this is done in two steps: the CountCovariates step and the TableRecalibration steps. Both can be run from the GATK package:
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R hg19.fa \ --DBSNP dbsnp132.txt \ -I input.marked.realigned.fixed.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile input.recal_data.csv
This step creates a .csv file which is needed for the next step and requires a dbSNP file, which can be downloaded at the UCSC Genome browser homepage DbSNP132 is the most novel one which can be downloaded from the UCSC browser, but dbSNP is updated regularly, so newer versions will be available in the future. Download the dbsnp132.txt.gz file and unzip it using gunzip (that's just an example).
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R hg19.fa \ -I input.marked.realigned.fixed.bam \ -T TableRecalibration \ --out input.marked.realigned.fixed.recal.bam \ -recalFile input.recal_data.csv
Produce raw SNP calls
SNP calling is done using the GATK UnifiedGenotyper program. It calls SNPs and short indels at the same time and gives a well annotated VCF file as output.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -glm BOTH \ -R hg19.fa \ -T UnifiedGenotyper \ -I input.marked.realigned.fixed.recal.bam \ -D dbsnp132.txt \ -o snps.vcf \ -metrics snps.metrics \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -dcov 1000 \ -A DepthOfCoverage \ -A AlleleBalance \ -L target_intervals.bed
Although this step is called filtering, I usually don't throw out possible wrong SNP calls and sometimes it proved to be useful to get back to those SNPs in a later step in the analysis. I prefer to flag them according to the reason why they should be filtered. The filtering scheme are partially the recommended ones by the GATK team and some are based on my experience. A SNP which passes through all the filters doesn't necessarily mean a true SNP call and SNPs filtered out don't necessarily define a sequencing artifact, but it gives a clue for possible reasons why a SNP could be wrong. (In case you've got several exomes (>30) Variant Quality Score recalibration will yield better results than pure filtering. For details see http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration)
java -Xmx4g -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -T VariantFiltration \ -B:variant,VCF snp.vcf.recalibrated \ -o snp.recalibrated.filtered.vcf \ --clusterWindowSize 10 \ --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \ --filterName "HARD_TO_VALIDATE" \ --filterExpression "DP < 5 " \ --filterName "LowCoverage" \ --filterExpression "QUAL < 30.0 " \ --filterName "VeryLowQual" \ --filterExpression "QUAL > 30.0 && QUAL < 50.0 " \ --filterName "LowQual" \ --filterExpression "QD < 1.5 " \ --filterName "LowQD" \ --filterExpression "SB > -10.0 " \ --filterName "StrandBias"
Annotations using annovar
Conversion to annovar file format
For annotating SNP calls I use the software annovar (http://www.openbioinformatics.org/annovar). It annotates a lot of different data to the SNPs and is especially suited for exome-level data-sets. At first we need to convert the VCF file format to the annovar file format. Annovar got it's own script to do that for us
convert2annovar.pl --format vcf4 --includeinfo snp.recalibrated.filtered.vcf > snp.annovar
include the –includeinfo argument as this will move the annotations from GATK (filters, SNP quality scores and everything else) to the annovar file. Another script annotates the annovar file. This script needs some annotation files, all of which can be downloaded at their homepage
Be sure to get all the hg19_xxx files if you've done alignment on the hg19 human assembly and save it in the humandb subfolder of the annovar folder. The script then produces a comma-separated text file with all the annotations, which can be viewed in Excel, OpenOffice Calc or similar programs.
summarize_annovar.pl --buildver hg19 snp.annovar ./humandb -outfile snps
It would be of great help if anyone can come up with suggestions or some pipeline script.