Question: Snp Calling Of Tcga Data Using Samtools
gravatar for pm2013
5.7 years ago by
pm201350 wrote:

Hi I am trying to identify SNPs from some TCGA data (Exomes & Transcriptomes) using Samtools mpileup. I have multiple files and I am calling SNPs on one file at a time (to preserve the sample identity, which is lost if you set up mpileup for multiple files at once (I believe). However, it is taking quite a bit of time (almost 1 sample per day). Is there any way to speed up the process? Also would sorting the bam files before the SNP call have any effect on the run time. Any help/suggestions is appreciated.


ADD COMMENTlink modified 5.6 years ago by tayebwajb90 • written 5.7 years ago by pm201350

I believe that mpileup requires a sorted BAM file, so you won't gain anything there. It's a bit tough to give you an answer if you don't post the current command you're using.

ADD REPLYlink written 5.7 years ago by Matt Shirley9.0k

Thanks Matt. I just use the default options as mentioned in the samtools webpage samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | varFilter -D100 > var.flt.vcf

ADD REPLYlink written 5.7 years ago by pm201350
gravatar for Sean Davis
5.7 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:
  1. If you are interested in coding variants, you could consider supplying a set of regions to samtools to limit calling to those regions.
  2. Consider running several samples in parallel since samtools is single-threaded.
ADD COMMENTlink written 5.7 years ago by Sean Davis25k
gravatar for Erik Garrison
5.7 years ago by
Erik Garrison2.1k
Somerville, MA
Erik Garrison2.1k wrote:

I've also noticed that samtools' default performance is quite slow. Until the recent development work on samtools catches up and provides suitable defaults, I suggest using freebayes. /shameless plug/

You can call somatic variants in a tumor/normal pair as such:

freebayes -f ref.fa --pooled-discrete --genotype-qualities tumor.bam normal.bam | vcfsamplediff -s VT normal tumor -

In my experience, getting good results this way has required post-filtering (using vcfsom specifically). I'm still waiting on results from the most-recent TCGA variant calling exercise. Please write the freebayes mailing list if you have any questions.

Oh, and you'll want to add read groups to the files so that you can call them all at the same time. This is easily done with bamaddrg.

ADD COMMENTlink written 5.7 years ago by Erik Garrison2.1k
gravatar for tayebwajb
5.6 years ago by
tayebwajb90 wrote:

Hi Erik, I used the command you suggested above to call somatic variants in tumor-normal pairs, and the resulting VCF file named, 'freebayes_pooled-discrete.vcf' had

##INFO=<ID=SSC,Number=1,Type=Float,Description="Somatic variant score (phred-scaled probability that the somatic variant call is correct).">

However, looking at some of the data lines and using 'grep SSC freebayes_pooled-discrete.vcf', none of the called variants had the Somatic variant score (SSC). Could you be knowing why this is so? The exact command line I used was:

freebayes  --no-indels --no-mnps --no-complex -f hg19.fa --pooled-discrete --genotype-qualities normal.bam tumor.bam | ./vcfsamplediff -s VT normal.bam tumor.bam - | ./vcffilter  -f "QUAL > 20" > freebayes_pooled-discrete.vcf

Also, I am using muTect and it has PASS and SOMATIC defined as:

##FILTER=<ID=PASS,Description="Accept as a confident somatic mutation"> 
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">

How do these compare with freebayes as I used it? I am trying to determine which parameters will best make freebayes and muTect comparable when it comes to calling somatic point mutations. The muTect command I used is:

java -Xmx2g -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence hg19.fa --dbsnp dbsnp_137.hg19.vcf --input_file:normal normal.bam --input_file:tumor tumor.bam --out mutect_results.txt --coverage_file mutect_coverage.wig --enable_extended_output  --vcf mutect_results_extended.vcf

Would really appreciate your advice!

ADD COMMENTlink written 5.6 years ago by tayebwajb90

You may want to look at the VT info field added by vcfsamplediff, which should have information on somatic/germline etc.

ADD REPLYlink written 5.6 years ago by Luca Beltrame210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 942 users visited in the last hour