Entering edit mode
5.5 years ago
J.F.Jiang ▴ 900
We are focusing on 3000 known SNPs.
Instead of genotype these SNPs from array or taqman, we use PCR to enrich these SNPs, and get them sequenced on HiSeq.
We found genotype these 3000 SNPs cost lots of time.
We split the 3000 SNPs into two categories,
1000 of them were called by GATK UnifiedGenotyper, while the others were genotyped by HaplotypeCaller, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES?
However, it take ~18h to process all these SNPs. Is there any other methods to genotype these SNPs?
You can specify target intervals for variant calling using the -L flag.
I did add -L snp.bed as the calling restriction, but the runing time did not decrease a lot even using the snp.vcf as the restriction region.
If you are just interested in identifying single base changes, then samtools mpileup / bcftools call is much quicker than the GATK, and has greater sensitivity for single nucleotide changes (from my experience).
Take a look at the code that I post here: C: Joint genotyping using GATK - how important is it?
Also take a look at my clinical-grade NGS pipeline on my GitHub page.
I was curious, so glanced at your ClinicalGradeDNAseq repo and it scared me. FixIndels.py says it outputs VCF, but it is certainly not valid VCF (e.g. look at what the spec says REF and ALT can consist of), and makes a bunch of assumptions that are certainly not true of general VCF. It looks to be broken if the input VCF would contain a call like AT -> A,ATT for example (it outputs two "deletion" records). I would also hope that a "clinical grade" pipeline would have a fairly extensive suite of tests.
It does have a very extensive suite of testing done against the gold standard, Sanger, and it picks up all Sanger-confirmed variants in the regions of interest of the lab where it was tested.
FixIndels.py does produce a variant of the VCF format, but this format was specific for the lab, i.e., I had to produce it that way in order for the VCFs to fit into the remaining pipeline already in place in the lab. Also, this format passes VCFtools and BCFtools initial checks for VCF. So, I disagree with your premise that this is 'scary'.
Thanks for taking time to review and question my work.
Edit: A call such as
AT -> A,ATTwould become 2 records:
Thanks for your interest in my work. I have now removed that Python script from the GitHub repository and modified the main analysis script. I knew that Python script would haunt me some day, but it wasn't used for the original work.
Note also the unique feature of the pipeline:
G'day down under