I only used so far three filters for my whole exome pipeline (aligning to hg19) for a HapMap sample. I tried it on the NA19240 Hapmap sample from paper below (Table 3) which shows ~196 variants (SNPs and INDELs). http://www.nature.com/nmeth/journal/v9/n2/extref/nmeth.1810-S1.pdf
However, using my filters as below I get = ~15000 (just NONSYNONYMOUSCODING alterations). If you add INDELS, it's going to be much higher number. What am I doing wrong?
My list of filters are:
1) vcfutils varFilter -D1000 2) snpEff -minQ 20 -minCoverage 30
Could they have different filters like frequency of variants etc.? If so, how do I set these up? Any help? What are the default parameters for # of reads (minimum) and frequency in bwa,samtools?
Below is my pipeline:
- bwa aln hg19.fa S375R1.fastq > S3751.sai
- bwa aln hg19.fa S375R2.fastq > S3752.sai
- bwa sampe hg19.fa S3751.sai S3752.sai S375R1.fastq S375R2.fastq > S375NoIndexL007.sam
- samtools view -bS S375NoIndexL007.sam > S375NoIndexL007.bam
- samtools sort S375NoIndexL007.bam S375NoIndexL007.sorted
- Marked duplicates using picard
- samtools index S375NoIndexL007.marked.bam
- samtools mpileup -uf hg19.fa S375NoIndexL007.marked.bam | bcftools view -bvcg - > S375NoIndexL007.raw.bcf
- bcftools view S375NoIndexL007.raw.bcf | vcfutils.pl varFilter -D1000 > S375NoIndexL007vard200.flt.vcf