GATK SNPs calling multiple sample without any known SNPs database
1
1
Entering edit mode
5.9 years ago

I am working on identifying the SNPs in the genome of an insect which is sequenced in three samples. Following the GATK best practice pipeline i have completed the following task.

  1. Quality filtered reads.
  2. Read Alignment for each sample eg. S1, S2, S3.
  3. Sort and mark duplicates in each sample S1, S2, S3.
  4. Realign target and Indel realinger on each sample S1, S2, S3. Now i am stuck with next steps. As it require the base recalibration (BSQR) step. which require the known database of snps. however i dont have any known snp database so i am following the alternative steps mentioned the GATK for non-model organism. That is
  5. I call the raw SNPs in gvcf model from each sample and then make joint calling for a final combined vcf file.
  6. Next, I applied the hard filters on this file and extract good quality SNPs and indels as a reference database of BSQR step.

Now the question is i have three realigned files S1_realigned.bam, S2_realigned.bam, S3_realigned.bam from step 4 and a reference database of SNP and Indels (if i am right) from the step 6. So how i would proceed further. Did i use against each sample separately or make a combine re-calibration table by providing all three in the same command.

Thanks Dr. Deepak

GATK SNP BSQR Multiple-sample • 2.8k views
ADD COMMENT
0
Entering edit mode

Hello deepkumar1983!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82707

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

If you are using GATK4 you don't need to do Indel Realignment since HaplotypeCaller performs local realignment around SNPs. Check out this post from GATK for more details.

ADD REPLY
0
Entering edit mode
5.9 years ago
BioinfGuru ★ 1.7k

You need to re-create the BAM files by repeating step 4 (BQSR) using the SNP database (vcf file) you just created in step 6 as the argument for the --known-sites option

This is from GATK BQSR GUIDELINES:

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data. Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator. Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence. The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

ADD COMMENT
0
Entering edit mode

Thanks for the help. I wish to know that whether the recalibration should be done on each sample separately or all the three could be combine in a single command. In first case, I got three read group table while in the 2nd it give one read group table. In your opinion which is good.

ADD REPLY
0
Entering edit mode

While I'm by no means the most experienced in this, I would guess to do step 4 separately. This is about re-calibrating the base calls WITHIN a sample, not ACROSS samples.

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6