Question: Best tool for variant calling
2
gravatar for williamsbrian5064
10 months ago by
williamsbrian5064150 wrote:

Hi,

I am attempting to reanalyze some old data. Recently, an updated genome was released for the organism I work with. I had to map the reads to the new genome and what not. So what I have now is a bam file but now I want to do some variant calling so I can generate a vcf.

I know there is a lot of programs out there to do this but I am attempting to find the best one. I have tried using Platypus but the vcf file that it generates never contains any genome data. That is probably my fault but I can't seem to figure it out and there isn't much support. I tried VarDict but when I go to look at the vcf file, I get an error of "failure to parse tbx_vcf". I can really seem to find why I am getting that error. I have tried Freebayes and that has worked for me, but I have an issue with indels. Freebayes seems to call SNPs instead of acknowledging indels. I know there is GATK, but it seem like such a tedious piece of software. It looks like you have to go through 15 steps to just get anything worthwhile. I could be looking at the wrong examples. What I have notice about GATK is they try to make it so simple that it ends up just making things complicated.

Could anyone maybe give some suggestions or maybe some insight? Any help is very appreciated

Thanks!!

snp next-gen alignment genome • 2.5k views
ADD COMMENTlink modified 4 months ago by claire0 • written 10 months ago by williamsbrian5064150

I doubt there is an objectively best variant caller for each situation. You also seem to put a lot of focus on ease of use, so that's apparently also important....

ADD REPLYlink written 10 months ago by WouterDeCoster37k

I'm not so worried about ease of use. I try to trouble shoot and read about the different software but sometimes there just isn't enough support out there to help figure out your issue. I just need something that is going to work really.

ADD REPLYlink written 10 months ago by williamsbrian5064150

Google Brain Team recently released DeepVariant. We implemented a reproducible version that was submitted to NF-Core (https://lifebit.page.link/cmRv).

We also made it available in Lifebit (https://lifebit.page.link/fiPH) if you want to try it with example parameters.

In practice, DeepVariant first builds images based on the BAM file, then it uses a DeepLearning image recognition approach to obtain the variants and eventually it converts the output of the prediction in the standard VCF format.

ADD REPLYlink written 4 months ago by claire0

Just leaving this here for matters of completeness.

ADD REPLYlink written 4 months ago by ATpoint14k

Has DeepVariant been benchmarked against the Gold Standards in clinical genetics, though? This question remained unanswered here, too: Why we ran DeepVariant as a Nextflow pipeline over cloud.

ADD REPLYlink written 4 months ago by Kevin Blighe39k

Google Brain Team recently released DeepVariant

Its was ~1 year ago...

ADD REPLYlink written 4 months ago by Kevin Blighe39k
5
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

While the GATK is popular, it is also slow and is overly complicated. From my experience from analysing both clinical and research NGS data, GATK is not the most sensitive for detecting germline single nucleotide variants (SNVs). Sensitivity here is gauged by comparing NGS results to those of Sanger sequencing.

As you've mentioned, there are many variant callers out there and each has advantages, including the GATK. A fool would be the person who would come here and say that this or that particular program / pipeline is the best, as they are each good in different scenarios.

My 'go to' pipeline is the 'old faithful':

  1. BWA mem for alignment (if read lengths > 70bp)
  2. Remove low MAPQ reads
  3. Mark and remove PCR / optical duplicates
  4. Produce random BAM subsets per sample (usually 3 subsets at 25%, 50%, and 75% 'randomly' selected reads)
  5. Call variants with BCFtools mpileup (main BAM and 3 subsets together) piped into BCFtools call

The final command would be something like:

bcftools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF \
--output-tags DP,AD -f "${Ref_FASTA}" \
--BCF Full.bam 75pcReads.bam 50pcReads.bam 25pcReads.bam | \
bcftools call --multiallelic-caller --variants-only -Ob > Variants.bcf

This has greater sensitivity for SNVs when compared to every other pipeline that I've tested, which includes GATK and Illumina's BaseSpace pipeline.

If you want to call indels, use pindel.

Kevin

ADD COMMENTlink modified 10 days ago • written 10 months ago by Kevin Blighe39k

I appreciate the help! That was the answer I was looking for really. I have worked with samtools and that works great for me. I will give mpileup a shot!

ADD REPLYlink written 10 months ago by williamsbrian5064150

Yes, don't worry about it. Sometimes people look at you a bit weird when you say that you're using SAMtools for variant calling, but the truth is that it has improved over the years and has some great minds behind its development. What I mention in my answer is just my experience of having processed DNA-seq data in research, clinical, and private settings for 6-7 years. The only other caveat that I'll mention is that, if you are looking for somatic variants, then you may consider another option.

Finally, once you call your variants, further filtering can be done with a Perl executable called vcfutils.pl varFilter, which comes bundled with BCFtools (look in the misc folder under BCFtools root).

If you want to add more tags to your final file, too, then that is also possible with BCFtools plugin. See here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

ADD REPLYlink written 10 months ago by Kevin Blighe39k

I concurr. This is my experience too. What version of samtools are you using?

ADD REPLYlink written 3 months ago by enxxx23210

Version has varied over the years, currently, 1.3.1. I believe we are supposed to use bcftools mpileup now. Funnily enough, private companies, caught up in the notion of 'licensed software means better software', seem convinced that GATK is the supreme caller.

Look at it this way, the pipeline that I mentioned above proved better (sensitivity and specificity to a gold standard) in a clinical setting when compared to NextGene by Sophia Genetics (>£20,000 license fee per year) and GATK (unknown license fee per year).

ADD REPLYlink written 3 months ago by Kevin Blighe39k

What is the purpose of passing the BAM and subsets thereof?

Would that be necessary if I have, for instance, 10 BAMs from different samples?

Thanks

ADD REPLYlink written 6 weeks ago by serpalma.v20

In my work in the clinical setting, I noted that GATK frequently missed true-positive variants that had been validated with Sanger sequencing. The way around this was to sample reads from a BAM file and created BAM subsets for each sample. Calling variants on each of these then allowed us to 'recover' all true-positive variants.

After ditching GATK, we found that bcftools mpileup could actually easily recover these 'missed' variants by GATK, however, we decided to continue to perform the sampling of reads on each sample, just in case.

If you have 10 samples, you can just pass them all to the same mpileup command. Variants will then reported in the output VCF/BCF for all samples.

ADD REPLYlink modified 10 days ago • written 6 weeks ago by Kevin Blighe39k

Regarding this point 4: Produce random BAM subsets per sample (usually 3 subsets at 25%, 50%, and 75% 'randomly' selected reads) Could you please explain the procedure or provide the command used?

Thanks Najeeb

ADD REPLYlink written 4 weeks ago by always_learning960
1

Sure thing. Here are the steps. Assume that you are starting with an aligned, sorted (and QC'd BAM) called Aligned_Sorted_PCRDuped_FiltMAPQ.bam, which is indexed by SAMtools:

echo "#######################################################" ;
echo "#Analysis step 6:  Downsampling / random read sampling#" ;
echo "#######################################################" ;
java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.75 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam ;

java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.5 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam ;

java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.25 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam ;

echo "Done." ;
echo -e "\n\n" ;

echo "###################################" ;
echo "#Analysis step 7:  Variant calling#" ;
echo "###################################" ;
bcftools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF --output-tags DP,AD -f "${Ref_FASTA}" \
  --BCF Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam | \
  bcftools call --multiallelic-caller --variants-only -Ob > Aligned_Sorted_PCRDuped_FiltMAPQ.bcf ;

Please review all parameters that I use here, along with their values.

If you want to add more information to the final BCF, take a look here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

ADD REPLYlink modified 7 days ago • written 4 weeks ago by Kevin Blighe39k

Hi, can you show your script for whole pipeline (from fastq files)?

ADD REPLYlink written 28 days ago by dentepre0

Hey, the entire pipeline is here: https://github.com/kevinblighe (see ClinicalGradeDNAseq).

You may want to spend a few hours tidying up the code to adapt it to your own pipeline.

ADD REPLYlink written 28 days ago by Kevin Blighe39k

Hi Kevin,

Do you suggest the same pipeline across PANEL and exome data as well?

Thanks Najeeb

ADD REPLYlink written 11 days ago by always_learning960

For single nucleotide germline variants, yes. Not for indels.

ADD REPLYlink written 11 days ago by Kevin Blighe39k
1

If the pipeline is too difficult to implement, then do not worry too much about it. It was designed from within a clinical environment. Try one of the more generic pipelines like GATK if you are based in research.

ADD REPLYlink written 10 days ago by Kevin Blighe39k

Hello!

  • Do you explicitly remove the duplicates? As far as I understand, if duplicates were marked, mpileup skips them.

  • If the flag --min-MQ is included, I guess step 2 is not necessary, right?

ADD REPLYlink written 28 days ago by serpalma.v20
1

I prefer to mark the duplicates and then completely remove (expunge) them from the BAM. This helps to avoid later ambiguity about whether another downstream program is counting them or not.

Again, the same for --min-MQ: I prefer to expunge these reads.

ADD REPLYlink written 28 days ago by Kevin Blighe39k

Thanks!

do you filter secondary and supplementary alignments? or are those assumed to be removed by filtering by mapping quality?

ADD REPLYlink written 22 days ago by serpalma.v20
1

Good question... I do not do anything with those. I assume that they will be removed but cannot confirm that.

ADD REPLYlink written 22 days ago by Kevin Blighe39k

Thanks again!

just one more question to understand --per-sample-mF.

I read that m corresponds to the min number of gapped reads for indel candidates (default: 1) and F corresponds to the min fraction of gapped reads (default: 0.002).

Does this mean that --per-sample-mF is applicable only to indel discovery?

ADD REPLYlink written 18 days ago by serpalma.v20
1

Hey serpalma.v. Yes, that is the interpretation, i.e., --per-sample-mF is only for indel candidates. Of course, mpileup is not the greatest for indel detection. I would use pindel for indels; mpileup is great for single nucleotide variants.

ADD REPLYlink written 18 days ago by Kevin Blighe39k
2
gravatar for Bastien Hervé
10 months ago by
Bastien Hervé3.7k
Limoges, CBRS, France
Bastien Hervé3.7k wrote:

For me, GATK Best Practises are the standard

Example here to call somatic mutations

Note that, If you want to have the most reliable results as possible you will have to go through these 15 steps because each step is significant.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Bastien Hervé3.7k

I get it. That issue I have is that it is significantly harder gather all the data and have it in the correct format to use GATK. It starts to get difficult when you work with a "unique" species.

That example is kind of what I am talking about. It is a great picture of a workflow but doesn't tell you anything about the software that workflow uses.

ADD REPLYlink written 10 months ago by williamsbrian5064150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 874 users visited in the last hour