Snp Calling Of Tcga Data Using Samtools
3
1
Entering edit mode
8.9 years ago
pm2013 ▴ 50

Hi I am trying to identify SNPs from some TCGA data (Exomes & Transcriptomes) using Samtools mpileup. I have multiple files and I am calling SNPs on one file at a time (to preserve the sample identity, which is lost if you set up mpileup for multiple files at once (I believe). However, it is taking quite a bit of time (almost 1 sample per day). Is there any way to speed up the process? Also would sorting the bam files before the SNP call have any effect on the run time. Any help/suggestions is appreciated.

Thanks

snp samtools mpileup tcga exome transcriptome • 4.7k views
0
Entering edit mode

I believe that mpileup requires a sorted BAM file, so you won't gain anything there. It's a bit tough to give you an answer if you don't post the current command you're using.

0
Entering edit mode

Thanks Matt. I just use the default options as mentioned in the samtools webpage samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

2
Entering edit mode
8.9 years ago
1. If you are interested in coding variants, you could consider supplying a set of regions to samtools to limit calling to those regions.
2. Consider running several samples in parallel since samtools is single-threaded.
2
Entering edit mode
8.9 years ago
Erik Garrison ★ 2.3k

I've also noticed that samtools' default performance is quite slow. Until the recent development work on samtools catches up and provides suitable defaults, I suggest using freebayes. /shameless plug/

You can call somatic variants in a tumor/normal pair as such:

freebayes -f ref.fa --pooled-discrete --genotype-qualities tumor.bam normal.bam | vcfsamplediff -s VT normal tumor -


In my experience, getting good results this way has required post-filtering (using vcfsom specifically). I'm still waiting on results from the most-recent TCGA variant calling exercise. Please write the freebayes mailing list if you have any questions.

Oh, and you'll want to add read groups to the files so that you can call them all at the same time. This is easily done with bamaddrg.

0
Entering edit mode
8.8 years ago
tayebwajb ▴ 110

Hi Erik, I used the command you suggested above to call somatic variants in tumor-normal pairs, and the resulting VCF file named, 'freebayes_pooled-discrete.vcf' had

##INFO=<ID=SSC,Number=1,Type=Float,Description="Somatic variant score (phred-scaled probability that the somatic variant call is correct).">


However, looking at some of the data lines and using 'grep SSC freebayes_pooled-discrete.vcf', none of the called variants had the Somatic variant score (SSC). Could you be knowing why this is so? The exact command line I used was:

freebayes  --no-indels --no-mnps --no-complex -f hg19.fa --pooled-discrete --genotype-qualities normal.bam tumor.bam | ./vcfsamplediff -s VT normal.bam tumor.bam - | ./vcffilter  -f "QUAL > 20" > freebayes_pooled-discrete.vcf


Also, I am using muTect and it has PASS and SOMATIC defined as:

##FILTER=<ID=PASS,Description="Accept as a confident somatic mutation">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">


How do these compare with freebayes as I used it? I am trying to determine which parameters will best make freebayes and muTect comparable when it comes to calling somatic point mutations. The muTect command I used is:

java -Xmx2g -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence hg19.fa --dbsnp dbsnp_137.hg19.vcf --input_file:normal normal.bam --input_file:tumor tumor.bam --out mutect_results.txt --coverage_file mutect_coverage.wig --enable_extended_output  --vcf mutect_results_extended.vcf


0
Entering edit mode

You may want to look at the VT info field added by vcfsamplediff, which should have information on somatic/germline etc.

Traffic: 2099 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.