Question: GATK SNPs calling multiple sample without any known SNPs database
0
gravatar for deepkumar1983
9 months ago by
United States
deepkumar198340 wrote:

I am working on identifying the SNPs in the genome of an insect which is sequenced in three samples. Following the GATK best practice pipeline i have completed the following task.

  1. Quality filtered reads.
  2. Read Alignment for each sample eg. S1, S2, S3.
  3. Sort and mark duplicates in each sample S1, S2, S3.
  4. Realign target and Indel realinger on each sample S1, S2, S3. Now i am stuck with next steps. As it require the base recalibration (BSQR) step. which require the known database of snps. however i dont have any known snp database so i am following the alternative steps mentioned the GATK for non-model organism. That is
  5. I call the raw SNPs in gvcf model from each sample and then make joint calling for a final combined vcf file.
  6. Next, I applied the hard filters on this file and extract good quality SNPs and indels as a reference database of BSQR step.

Now the question is i have three realigned files S1_realigned.bam, S2_realigned.bam, S3_realigned.bam from step 4 and a reference database of SNP and Indels (if i am right) from the step 6. So how i would proceed further. Did i use against each sample separately or make a combine re-calibration table by providing all three in the same command.

Thanks Dr. Deepak

snp bsqr multiple-sample gatk • 443 views
ADD COMMENTlink modified 9 months ago by YaGalbi1.4k • written 9 months ago by deepkumar198340

Hello deepkumar1983!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82707

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 9 months ago by Pierre Lindenbaum118k

If you are using GATK4 you don't need to do Indel Realignment since HaplotypeCaller performs local realignment around SNPs. Check out this post from GATK for more details.

ADD REPLYlink written 8 months ago by James Reeve70
0
gravatar for YaGalbi
9 months ago by
YaGalbi1.4k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

You need to re-create the BAM files by repeating step 4 (BQSR) using the SNP database (vcf file) you just created in step 6 as the argument for the --known-sites option

This is from GATK BQSR GUIDELINES:

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data. Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator. Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence. The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

ADD COMMENTlink modified 9 months ago • written 9 months ago by YaGalbi1.4k

Thanks for the help. I wish to know that whether the recalibration should be done on each sample separately or all the three could be combine in a single command. In first case, I got three read group table while in the 2nd it give one read group table. In your opinion which is good.

ADD REPLYlink written 9 months ago by deepkumar198340

While I'm by no means the most experienced in this, I would guess to do step 4 separately. This is about re-calibrating the base calls WITHIN a sample, not ACROSS samples.

ADD REPLYlink modified 9 months ago • written 9 months ago by YaGalbi1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour