Fasted method to genotype given SNPs from NGS data
0
0
Entering edit mode
4.7 years ago
J.F.Jiang ▴ 880

Hi all,

We are focusing on 3000 known SNPs.

Instead of genotype these SNPs from array or taqman, we use PCR to enrich these SNPs, and get them sequenced on HiSeq.

We found genotype these 3000 SNPs cost lots of time.

We split the 3000 SNPs into two categories,

1000 of them were called by GATK UnifiedGenotyper, while the others were genotyped by HaplotypeCaller, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles \$vcf --output_mode EMIT_ALL_SITES?

However, it take ~18h to process all these SNPs. Is there any other methods to genotype these SNPs?

Thanks, Junfeng

genotype SNP NGS • 1.1k views
ADD COMMENT
0
Entering edit mode

You can specify target intervals for variant calling using the -L flag.

ADD REPLY
0
Entering edit mode

I did add -L snp.bed as the calling restriction, but the runing time did not decrease a lot even using the snp.vcf as the restriction region.

ADD REPLY
0
Entering edit mode

If you are just interested in identifying single base changes, then samtools mpileup / bcftools call is much quicker than the GATK, and has greater sensitivity for single nucleotide changes (from my experience).

Take a look at the code that I post here: C: Joint genotyping using GATK - how important is it?

Also take a look at my clinical-grade NGS pipeline on my GitHub page.

ADD REPLY
0
Entering edit mode

I was curious, so glanced at your ClinicalGradeDNAseq repo and it scared me. FixIndels.py says it outputs VCF, but it is certainly not valid VCF (e.g. look at what the spec says REF and ALT can consist of), and makes a bunch of assumptions that are certainly not true of general VCF. It looks to be broken if the input VCF would contain a call like AT -> A,ATT for example (it outputs two "deletion" records). I would also hope that a "clinical grade" pipeline would have a fairly extensive suite of tests.

ADD REPLY
0
Entering edit mode

It does have a very extensive suite of testing done against the gold standard, Sanger, and it picks up all Sanger-confirmed variants in the regions of interest of the lab where it was tested.

FixIndels.py does produce a variant of the VCF format, but this format was specific for the lab, i.e., I had to produce it that way in order for the VCFs to fit into the remaining pipeline already in place in the lab. Also, this format passes VCFtools and BCFtools initial checks for VCF. So, I disagree with your premise that this is 'scary'.

Thanks for taking time to review and question my work.

Kevin

Edit: A call such as AT -> A,ATT would become 2 records:

AT -> A
AT -> ATT


then:

T -> -
- -> T

ADD REPLY
0
Entering edit mode

Thanks for your interest in my work. I have now removed that Python script from the GitHub repository and modified the main analysis script. I knew that Python script would haunt me some day, but it wasn't used for the original work.

Note also the unique feature of the pipeline:

The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are the called on all 4 BAMs and then the consensus VCF is produced.

G'day down under

ADD REPLY

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6