Downstream analysis of VCF files obtained from VarScan2
Entering edit mode
5.6 years ago
Raheleh ▴ 260


I am sorry for the very basic and newbie question but I am confusing and getting lost in my thoughts. I have WES data of tumor samples with matched ones extracted from 13 patients (paired-end, illumina). I used BWA mem to align them against hg38, and used VarScan2 to call somatic variations. Now I have 6 files (fpfilter_Passed.vcf) for each patient (Somatic, LOH, Germline for each of snp and indel variations). My question is, what I have to do in this step? Should I merge all 6 files together?

Afterwards should I merge the VCF file of all patients together and then use SnpEff for downstream analysis?

This is the all commands that I used till this step:

#Trim using Trimmomatic
java -jar Trimmomatic-0.38/trimmomatic-0.38.jar PE normal_1_1.fastq normal_1_2.fastq -baseout sample_1 LEADING:30 TRAILING:30 MINLEN:50

#Align against ref using bwa
bwa index hg38.fa
bwa mem hg38.fa normal_1_1P normal_1_2P > normal_1.sam

#Convert sam to bam
samtools view -bS normal_1.sam > normal_1.bam
samtools sort normal_1.bam -o normal_1.sorted.bam
samtools index normal_1.sorted.bam

# Get the raw variants:
samtools mpileup -q 20 -Q 25 -B -d 1000 -f hg38.fa normal_1.sorted.bam tumor_1.sorted.bam | java -jar VarScan.v2.4.3.jar somatic /dev/stdin outputName -mpileup --strand-filter 1 --output-vcf

# Classify into Germ, LOH and Somatic:
java -jar VarScan.v2.4.3.jar processSomatic *.snp.vcf --max-normal-freq 0.01
java -jar VarScan.v2.4.3.jar processSomatic *.indel.vcf --max-normal-freq 0.01

#Preparing BED file
egrep -hv "^#" Germline.hc.vcf | awk 'OFS="\t" {print $1, $2-1, $2+1}' | sort -k1,1 -k2,2n | bedtools merge -i - > Germline.hc.bed

#Run bam-readcount:
bam-readcount -f hg38.fa -q 20 -b 25 -d 1000 -l Germline.hc.bed -w 1 normal_1.bam > Germline.hc.bamRC

#Run fpfilter for all somatics, germlines and LOHs for both snp and indel:

java -jar VarScan.v2.4.3.jar fpfilter outputName.snp.Germline.hc.vcf Germline.hc.bamRC --output-file Germline.hc.fpfilterPassed.vcf --filtered-file Germline.hc.fpfilterFailed.vcf
java -jar VarScan.v2.4.3.jar fpfilter outputName.snp.LOH.hc.vcf LOH.hc.bamRC --output-file LOH.hc.fpfilterPassed.vcf --filtered-file LOH.hc.fpfilterFailed.vcf
java -jar VarScan.v2.4.3.jar fpfilter outputName.snp.Somatic.hc.vcf Somatic.hc.bamRC --output-file Somatic.hc.fpfilterPassed.vcf --filtered-file Somatic.hc.fpfilterFailed.vcf
java -jar VarScan.v2.4.3.jar fpfilter outputName.indel.Germline.hc.vcf Germline.hc.bamRC --output-file indel.Germline.hc.fpfilterPassed.vcf --filtered-file indel.Germline.hc.fpfilterFailed.vcf
java -jar VarScan.v2.4.3.jar fpfilter outputName.indel.LOH.hc.vcf LOH.hc.bamRC --output-file indel.LOH.hc.fpfilterPassed.vcf –filtered-file indel.LOH.hc.fpfilterFailed.vcf
java -jar VarScan.v2.4.3.jar fpfilter outputName.indel.Somatic.hc.vcf Somatic.hc.bamRC --output-file indel.Somatic.hc.fpfilterPassed.vcf --filtered-file indel.Somatic.hc.fpfilterFailed.vcf

Can anyone help me out? Thanks!

WES VCF Somatic variations VarScan2 SnpEff • 4.2k views
Entering edit mode

Hello, samtools mpileup can merge multiple files, then use varscan to analyse, I don't know if this fits you.

Entering edit mode

Thank you MatthewP. But my question is should I merge the files of all 13 patients together? I mean at the end for downstream analysis, there must be only one VCF file? Or for every patient, vcf file should be analysed separately?

Entering edit mode
5.3 years ago
sutturka ▴ 190

It may be better to convert VCF to MAF format (We do this for somatic variants only) using the vcf2maf utility. Once in MAF format, to merge the patients data together, you can simply cat the files together and patient identity is maintained in a separate column. Combined MAF file can be analyzed through tools like maftools to get top mutated genes and other interpretations.

Entering edit mode
5.3 years ago
bari.ballew ▴ 470

Just realized this is an older question, but I already wrote up an answer so hopefully this is still helpful!

First, regarding merging somatic, germline, and LOH files for SNPs and indels: this depends on the question you're trying to answer, but you probably don't want to merge them, at least at first. Quality thresholds for filtering out false positive SNPs are different from the thresholds for indels, so you will likely want to treat these two variant types differently. If you are asking questions like "What variants do I see that may be driving tumor progression?" or "Are there any recurrent variants in my patients' tumors?", then you will want to focus on somatic variants, and would not want to combine them with germline variants. Alternatively, if you're asking something like "Do my patients tend to have a germline variant in common, that may be contributing to cancer susceptibility?", you would focus on the germline variants.

Second, I would recommend against combining VCFs for multiple patients unless they were called jointly (not often the case when doing somatic calling). VCFs only report loci with variant calls. If patient 1 has a variant call at a certain position, but patient 2 has no call there, is it because the second patient was homozygous reference, or because there wasn't enough coverage to call a variant? If you merge across patients, you will end up with many ambiguous sites like this, and they will be annotated in your vcf as "./." without any additional information (e.g. coverage). If you must merge across patients, you can use bcftools merge, but keep this limitation in mind.


Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6