Question: samtools variant calling workflow
2
gravatar for mirza
23 months ago by
mirza80
India
mirza80 wrote:

Hi,

I am working on variant calling on fungal genomes. I have Illumina HiSeq reads. I am new to this and am following this workflow

Step 1: QC of raw reads performed using FastQC tool

Step 2: Preprocessing of raw reads performed using Trimmomatic-v0.36 tool

Step 3: QC of clean reads performed using FastQC tool

Step 4: Alignment to reference genome using bowtie2-v2.2.6 tool

Step 5: SAM to BAM (alignment files) conversion using samtools-v1.3.1

Step 6: Remove duplicates using sambamba-0.6.6 tool

Step 7: coordinate sorting of bam files with samtools

Step 8: Variant calling performed using samtools/bcftools

Step 9: variant filtering with bcftools

In this post Workflow Or Tutorial For Snp Calling? and other variant calling related resources I found 2 additional steps before variant calling i.e. local realignment and base quality recalibration.

  1. Are these steps essential?
  2. I found that these option are not available in samtools but in GATK. How can I perform these steps for my data?
ADD COMMENTlink modified 23 months ago by Santosh Anand4.6k • written 23 months ago by mirza80
2
gravatar for Santosh Anand
23 months ago by
Santosh Anand4.6k
Santosh Anand4.6k wrote:

Local realignment is not needed anymore with GATK latest pipeline using HaplotypeCaller. You may just forget it as it's utility is only in calling SNPs nearby Indels, and there is no definite guideline how much it helps when using with tools other than GATK.

https://software.broadinstitute.org/gatk/blog?id=7847

Base quality re-calibration can still give you some advantages, as it claims to rectify the base call probabilities, which are used by almost all of the variant callers to give more confidence to the called variant. It is easy to do, provided you have GATK pipeline set-up at your end.

http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr

(TL;DR: See "Creating a recalibrated BAM" in above doc)

PS: Your reads must have RG-tags in bam-files to use most of the GATK protocols/tools. You may search my earlier post regarding that. PPS: If it looks too troublesome, you may ignore it too as this not an obligatory step.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Santosh Anand4.6k

Hi Santosh,

I have already completed upto Step7 of the work flow above using the tools mentioned and no I don't have GATK installed. What would you suggest in this case? I am supposed to use this work flow maybe coz they have been using this here.

ADD REPLYlink written 23 months ago by mirza80

You may proceed with your current pipeline, also because doing everything GATK way will need some time to learn. However, it is better to learn GATK in long-term as that is more or less standard now. You may start from here for GATK https://software.broadinstitute.org/gatk/best-practices/

ADD REPLYlink written 23 months ago by Santosh Anand4.6k

I think I can add these 2 steps in my work flow (using gatk) before the 8th step i.e. actual variant calling. I read someone do it in some post here on Biostar. What would you suggest?

ADD REPLYlink written 23 months ago by mirza80

My opinion is that it will not harm you in any case. So you should try it if you could.

ADD REPLYlink written 23 months ago by Santosh Anand4.6k

Thanks, I appreciate.

ADD REPLYlink written 23 months ago by mirza80

@Santosh Hi, GATK requires read group tag @RG. I have aligned my reads using bowie2 (without the use of RG tag, as I never required it earlier) and completed the 7 steps of my workflow. Now, I read that we can add RG using picard but the big question, what values I will give in the command? I have no clue. My bam files look like this,

$ samtools view  brs11b_dr.bam | head -n 10

ST-E00251:108:HLFTVCCXX:4:1101:14722:2733   83  scaffold17  386238  42  150M    =   386080  -308    AGAGGTCCAAGGGGGTCATGTACATCGTCAAGCAACGCACACCAGGCGCGACATCCAAAAATCAAGACCAGAAATAGAAAAAGTAGATAAAAAAAATGAAAAGAACAGATCGTAACTATGACTGGCCTCCCCGCCTAGAAATAGACCCAA  KKKKKKFA7KFKFKKKKKKKKKKKKKKKKKKKKKKKK<KKFFF<KKKKKKKKKFKKKKKKKFFFKKKFAKKKKKFKKKKKKKKKKKFFKKKKKKKKKKFKKKKKK7FKKFFKKKKKKKKKKKKKKKFKKFKKKKKKKKKKKKKKKFFFAA  AS:i:-17    XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:12C13A73G49    YS:i:-5 YT:Z:CP
ST-E00251:108:HLFTVCCXX:4:1101:14722:2733   163 scaffold17  386080  42  150M    =   386238  308 ATGAAACAAGGTAATCAACTGAATTGTGTTGCTATTGTATTGAACAAAGATAGCTAAACCATTTACACTGAAAGCGCGTGAGTCATCCAAAAGACAAACCAAAGCCAAGACGTCAGTAGCGCATTATCAACCTAAAGCACAACATTTACA  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKFKKKKKKKKKAFFKKKFFFFKKKKKKKAFFKKKAFFKKKFFKKKFAAFKKKFKK7AFFKFFAFKKKKKKKKAFKKAFAAKAF<FFKKKKFF  AS:i:-5XN:i:0   XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:88G61  YS:i:-17    YT:Z:CP
ST-E00251:108:HLFTVCCXX:4:1101:15595:2733   99  scaffold1   2536280 40  95M =   2536545 327 ATCCAAGACTATCTTAAGCGTATCGTCAAGTATACAAATGTTGAGGTTGGTCGGAATAGGATTCCGCAGGTGTTCTAAGGAGCTGACAACCACTT AAFFFKKKKKKKKKKKKKKKAFKKKKKKKKFKKKKFKKKFFAKKA,FKKKKKKKKAKKKKKKKKKAKKFFFKAKK<FKFFF,FKKKKKKKKKKFA AS:i:-5 XN:i:0  XM:i:1XO:i:0    XG:i:0  NM:i:1  MD:Z:20C74  YS:i:-17    YT:Z:CP
ST-E00251:108:HLFTVCCXX:4:1101:15595:2733   147 scaffold1   2536545 40  62M =   2536280 -327    CCCGGGTTAGTAATCCCATCTCTATCTTGATATGGCGAACTCACCCACGTTCGTACCCCTAG  FAAKKKKKKKKKF<KKKKKFKKFAFFKFF7KKKFFKKKKKKKKKFAKKKKKKKKKKKFFFAA  AS:i:-17    XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:51T2G2A4   YS:i:-5 YT:Z:CP

and so on...

ADD REPLYlink modified 23 months ago • written 23 months ago by mirza80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 836 users visited in the last hour