Dear my friends,
I used Star mapping and VarScan2 to call somatic mutations from 77 pairs of lung cancer and normal tissues with the command lines below.
STAR_2.3.0e.Linux_x86_64/STAR \
--genomeDir genomedir/ \
--runThreadN 4 \
--readFilesIn /home/jli/err/ERR164578_1.fastq /home/jli/err/ERR164578_2.fastq \
--outFileNamePrefix /home/jli/Lung_cancer/ERR164578. \
--outSAMstrandField intronMotif
samtools mpileup \
-f genomedir/Homo_sapiens_assembly19.fasta \
Lung_cancer/ERR164578.bam > Lung_cancer/ERR164578.pileup
The generation of mpileup files was very slow, it took almost 10 hours to finish producing pileup file for one sample. What 's worse, when I used the following command to detect somatic variants from a pair of normal and cancer samples.
java -jar VarScan.v2.3.7.jar \
somatic \
Lung_cancer/ERR164493.pileup \
Lung_cancer/ERR164578.pileup \
Lung_cancer/ERR164578_VarScan.snp \
--output-vcf 1
Varscan took more than 10 hours to call variants and detected huge number of germline snps.
chr15 77154793 . N C . PASS DP=62;SS=1;SSC=0;GPV=6.4572E-34;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:36:100%:0,0,16,20 1/1:.:25:0:21:100%:0,0,11,10
chr15 77154794 . N T . PASS DP=62;SS=1;SSC=0;GPV=2.5602E-33;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:36:100%:0,0,16,20 1/1:.:25:0:20:100%:0,0,10,10
chr15 77154795 . N A . PASS DP=62;SS=1;SSC=0;GPV=2.5602E-33;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:36:100%:0,0,16,20 1/1:.:25:0:20:100%:0,0,11,9
chr15 77154796 . N T . PASS DP=62;SS=1;SSC=0;GPV=4.0229E-32;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:36:100%:0,0,16,20 1/1:.:25:0:18:100%:0,0,10,8
chr15 77154797 . N T . PASS DP=63;SS=1;SSC=0;GPV=6.4572E-34;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:36:100%:0,0,16,20 1/1:.:26:0:21:100%:0,0,10,11
chr15 77154798 . N A . PASS DP=61;SS=1;SSC=0;GPV=1.5943E-31;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:37:100%:0,0,16,21 1/1:.:24:0:16:100%:0,0,8,8
chr15 77154799 . N G . PASS DP=60;SS=1;SSC=0;GPV=4.0229E-32;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:37:100%:0,0,16,21 1/1:.:23:0:17:100%:0,0,8,9
You could see the reference of each variant is always N, I did the same using bam files generated by tophat mapping, the whole process was much faster, moreover, the number of snps called was more reasonable. did anyone have the same problems before? Any suggestion will be appreciated.