Question: Fasted method to genotype given SNPs from NGS data
gravatar for J.F.Jiang
2.9 years ago by
J.F.Jiang830 wrote:

Hi all,

We are focusing on 3000 known SNPs.

Instead of genotype these SNPs from array or taqman, we use PCR to enrich these SNPs, and get them sequenced on HiSeq.

We found genotype these 3000 SNPs cost lots of time.

We split the 3000 SNPs into two categories,

1000 of them were called by GATK UnifiedGenotyper, while the others were genotyped by HaplotypeCaller, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES?

However, it take ~18h to process all these SNPs. Is there any other methods to genotype these SNPs?

Thanks, Junfeng

snp genotype ngs • 786 views
ADD COMMENTlink written 2.9 years ago by J.F.Jiang830

You can specify target intervals for variant calling using the -L flag.

ADD REPLYlink written 2.9 years ago by WouterDeCoster44k

I did add -L snp.bed as the calling restriction, but the runing time did not decrease a lot even using the snp.vcf as the restriction region.

ADD REPLYlink written 2.9 years ago by J.F.Jiang830

If you are just interested in identifying single base changes, then samtools mpileup / bcftools call is much quicker than the GATK, and has greater sensitivity for single nucleotide changes (from my experience).

Take a look at the code that I post here: C: Joint genotyping using GATK - how important is it?

Also take a look at my clinical-grade NGS pipeline on my GitHub page.

ADD REPLYlink written 2.9 years ago by Kevin Blighe66k

I was curious, so glanced at your ClinicalGradeDNAseq repo and it scared me. says it outputs VCF, but it is certainly not valid VCF (e.g. look at what the spec says REF and ALT can consist of), and makes a bunch of assumptions that are certainly not true of general VCF. It looks to be broken if the input VCF would contain a call like AT -> A,ATT for example (it outputs two "deletion" records). I would also hope that a "clinical grade" pipeline would have a fairly extensive suite of tests.

ADD REPLYlink written 2.9 years ago by Len Trigg1.5k

It does have a very extensive suite of testing done against the gold standard, Sanger, and it picks up all Sanger-confirmed variants in the regions of interest of the lab where it was tested. does produce a variant of the VCF format, but this format was specific for the lab, i.e., I had to produce it that way in order for the VCFs to fit into the remaining pipeline already in place in the lab. Also, this format passes VCFtools and BCFtools initial checks for VCF. So, I disagree with your premise that this is 'scary'.

Thanks for taking time to review and question my work.


Edit: A call such as AT -> A,ATT would become 2 records:

AT -> A


T -> -
- -> T
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Kevin Blighe66k

Thanks for your interest in my work. I have now removed that Python script from the GitHub repository and modified the main analysis script. I knew that Python script would haunt me some day, but it wasn't used for the original work.

Note also the unique feature of the pipeline:

The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are the called on all 4 BAMs and then the consensus VCF is produced.

G'day down under

ADD REPLYlink written 2.9 years ago by Kevin Blighe66k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1293 users visited in the last hour