Question: From fastq file to vcf file (inconsistent vcf file from two methods)
gravatar for illinois.ks
5.0 years ago by
Korea, Republic Of
illinois.ks160 wrote:

I have created the vcf file from fastq file using recent GATK pipeline.. 


After I finished the varaint discovery procedure(inclduing thevariant recalibration), I can get the vcf file which are ready to annotate using other tools such as snpEff.. etc.. 


but the queston is this. 

Our miSeq machine provided by Illumina provided built-in program to make vcf file from fastq file automatically.. 

( In this case, I don't need to run GATK by myself. the machine build-in program will do everything.. I checked that they also use GATK pipeline.)


However, my vcf file ( I created by myself with GATK pipeline) and the automatically generated vcf file by illumina machine is very different at the perspective of number of variants. 


I know that the different program report different variant calls. However, the automatically generated vcf file generated by illumina machine has about 9300 variants called. However, my vcf file ( i generated using GATK) has 55000 variants, which are huge. 


I know I need to filter out some variants based on several criteria such as  read depth, quality score etc. But, I think at the very beginning, the number of callled variants should be comparable.. Do I miss something?


Could you please someone help me with this? 




illumina next-gen gatk vcf • 3.6k views
ADD COMMENTlink modified 5.0 years ago by Daniel Swan13k • written 5.0 years ago by illinois.ks160

Going from fastq to vcf is a long way. At first you have to align reads against the reference (and aligners can already introduce differences). Then SNP calling can be done using different parameters and this might also affect results.

I suggest you look for some tutorial on SNP calling using GATK and some using the miseq builtin tools.

ADD REPLYlink written 5.0 years ago by Fabio Marroni2.5k
gravatar for Daniel Swan
5.0 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

The issue of concordance between genotypers has been discussed before:

If you're using a black box pipeline, versus your hand-rolled GATK then there's all kinds of variables that may or may not be in play.  As someone has already pointed out unless you may already be looking at data that has come from two separate aligners. You may be calling SNP's across a whole genome with GATK, whereas maybe the Illumina calls are restricted to e.g. regions of enrichment from an amplicon assay or exome capture.  There's too many variables to diagnose.



ADD COMMENTlink written 5.0 years ago by Daniel Swan13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1882 users visited in the last hour