Question

Finding somatic and germline variations in tumor samples with matched ones (paired-end, illumina)

0

Entering edit mode

5.4 years ago

Raheleh ▴ 260

Hello, I am new to the field of NGS data analysis and currently analyzing WES data from tumor samples with matched ones (paired-end, illumina). I am using linux command to analyze the data. This is what I did till now for each sample:

fastqc sample.fastq
java -jar trimmomatic-0.38.jar PE sample_1.fastq sample_2.fastq -basedout sample LEADING:30 TRAILING:30 MINLEN:50
bowtie2-build hg38.fa hg38
bowtie2 -x hg38 -1 sample_1P -2 sample_2P -S sample.sam
samtools view -bS sample.sam > sample.bam
samtools sort sample.bam -o sample.sorted.bam
samtools mpileup -uf hg38.fa sample.sorted.bam > sample.mpileup

I don’t know after this step what is the reasonable step to take? I am keen on finding somatic and germline variations. I am using varscan, however I am confused. Shall I use “ java -jar VarScan.jar somatic normal.pileup tumor.pileup “? what is different between pileup and mpileup file?

Any help will be very appreciated. Thanks

WES data mpileup file varscan • 1.5k views

ADD COMMENT • link updated 5.4 years ago by ATpoint 81k • written 5.4 years ago by Raheleh ▴ 260

score 1 · Answer 1 · 2018-11-11

A couple of things: First, I would change from bowtie2 to BWA mem because most variant calling pipelines assume BWA as the aligner. Second, you can shorten your commands by using pipes like align (options...) | samtools sort -o sorted.bam -. This will save time and disk space. Third, given that you start a new project, consider to use a more recent variant caller than VarScan2. There is nothing wrong with VarScan2 but it is no longer maintained which is why I personally switched to strelka2 from Illumina recently. If you still want to use VarScan2, you might have a look at my pipeline at Github for it. It is an admittedly ugly script but you can use it to get an idea how the VarScan2 subcommands are to be used. It starts by calling raw variants using mpileup/varscan2 somatic, extracts germline and somatic high confidence variants with processSomatic and then applies the recommended heuristic fpfilter to remove potential junk calls. Still, I encourage you to use a more recent caller like strelka2, which has also has more complete documentation, making the start into the variant field easier for you.