Question: Best Way To Compare Output Snp/Indels From Different Software?
6
gravatar for Hmm
8.5 years ago by
Hmm500
Hmm500 wrote:

I have a list of SNPs and indels from 4 different software.

IndelGenotyper (broad tool)—for calling somatic indels Bambino—for calling SNP/Indels Somaticsniper—SNP calling Varscan – indel/snp/LOH

Now my task is to compare SNP and indels from different software. I need some input as to how to do that. I only have output in VCF format for IndelGenotyper. The rest of the software have the following output format:

IndelGenotyper (VCF):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    normal    tumor    
1    177    .    A    AC    .    .    N_AC=295,310;N_DP=867;N_MM=3.1084745,4.6678634;N_MQ=18.840677,20.062836;N_NQSBQ=26.988136,22.908127;N_NQSMM=0.0111864405,0.13539149;N_SC=12,283,116,441;T_AC=234,239;T_DP=691;T_MM=3.2051282,4.7743363;T_MQ=20.816238,21.225664;T_NQSBQ=27.663815,23.273888;T_NQSMM=0.011976048,0.13911061;T_SC=9,225,85,367    GT:GQ    0/1:0    0/1:0
1    230    .    AC    A    .    .    N_AC=218,227;N_DP=663;N_MM=4.03211,4.456422;N_MQ=16.825687,20.529816;N_NQSBQ=32.26916,29.760443;N_NQSMM=0.002770083,0.04280557;N_SC=18,200,73,363;T_AC=162,170;T_DP=465;T_MM=4.351852,4.745763;T_MQ=18.141975,21.064407;T_NQSBQ=32.513077,30.064604;T_NQSMM=0.0124533,0.045261122;T_SC=5,157,34,261    GT:GQ    0/1:0    0/1:0
1    247    .    TA    T    .    .    N_AC=156,162;N_DP=392;N_MM=4.467949,4.3130436;N_MQ=20.666666,19.878262;N_NQSBQ=30.026958,30.963959;N_NQSMM=0.019897304,0.07715736;N_SC=31,125,65,165;T_AC=104,113;T_DP=267;T_MM=4.2115383,4.766234;T_MQ=23.403847,20.603895;T_NQSBQ=31.615385,30.121395;T_NQSMM=0.011538462,0.06525038;T_SC=16,88,24,130    GT:GQ    0/1:0    0/1:0
1    254    .    TA    T    .    .    N_AC=123,144;N_DP=294;N_MM=4.593496,4.613333;N_MQ=21.04878,21.833334;N_NQSBQ=30.16913,29.987564;N_NQSMM=0.018883415,0.0987564;N_SC=27,96,63,87;T_AC=96,108;T_DP=195;T_MM=4.8854165,4.2988505;T_MQ=24.5625,22.16092;T_NQSBQ=30.090431,29.74282;T_NQSMM=0.027339643,0.086142324;T_SC=19,77,23,64    GT:GQ    0/1:0    0/1:0
1    3820    .    TC    T    .    .    N_AC=8,8;N_DP=13;N_MM=1.5,1.8;N_MQ=14.5,12.0;N_NQSBQ=33.7125,31.763159;N_NQSMM=0.0,0.0;N_SC=0,8,0,5;T_AC=8,8;T_DP=12;T_MM=1.25,1.25;T_MQ=10.625,13.25;T_NQSBQ=31.725,31.914286;T_NQSMM=0.0,0.028571429;T_SC=0,8,0,4    GT:GQ    0/1:0    0/1:0
1    235278    .    C    CT    .    .    N_AC=2,2;N_DP=18;N_MM=1.5,0.4375;N_MQ=20.5,12.0625;N_NQSBQ=18.45,33.529034;N_NQSMM=0.0,0.006451613;N_SC=1,1,15,1;T_AC=5,5;T_DP=15;T_MM=1.6,1.1;T_MQ=26.0,26.5;T_NQSBQ=34.2,27.813187;T_NQSMM=0.0,0.054945055;T_SC=5,0,6,4    GT:GQ    0/1:0    0/1:0

somaticsniper:

Chromosome    Position    Reference base    IUB genotype of tumor    Somatic Score    Tumor Consensus quality    Tumor SNV quality    Tumor RMS mapping quality    Depth in tumor (# of reads crossing the position)    Depth in normal (# of reads crossing the position)
1    109    A    W    0    228    228    25    2345    3196
1    177    A    M    0    228    228    23    1399    1770
1    180    T    Y    0    228    228    22    1410    1781
1    250    A    M    1    21    21    24    452    598
1    257    A    M    1    9    9    25    391    508
1    291    C    Y    0    59    176    25    224    312

Bambino:

NormalSample    TumorSample    Name    Chr    Pos    Type    Size    Coverage    Percent_alternative_allele    Chr_Allele    Alternative_Allele    P-value    Text    unique_alternative_ids    reference_normal_count    reference_tumor_count    alternative_normal_count    alternative_tumor_count    count_ref_normal_fwd    count_ref_normal_rev    count_ref_tumor_fwd    count_ref_tumor_rev    count_var_normal_fwd    count_var_normal_rev    count_var_tumor_fwd    count_var_tumor_rev    alternative_fwd_count    alternative_rev_count    alternative_bidirectional_confirmation    broad_coverage    broad_reference_normal_count    broad_reference_tumor_count    broad_alternative_normal_count    broad_alternative_tumor_count    unique_alt_read_starts    unique_alt_read_starts_fwd    unique_alt_read_starts_rev
P7norm.bam    P7tum.bam    chr1.109    chr1    109    SNP    1    2808    0.30477223    A    T    1.0    CCTAACCCTAACCCTAACCC[A/T]ACCCTAACCCTAACCCTAAC    796    1113    810    484    359    570    543    404    406    217    267    170    189    387    456    1    3760    1513    1122    661    464    89    80    70
P7norm.bam    P7tum.bam    chr1.147    chr1    147    deletion    1    1302    0.343318    C        1.0        445    410    347    284    163    219    191    182    165    20    264    15    148    35    412    1    1513    520    465    333    195    85    27    76
P7norm.bam    P7tum.bam    chr1.178    chr1    178    insertion    1    1805    0.46371192        C    1.0        837    566    426    477    360    137    429    99    327    19    458    12    348    31    806    1    2491    838    681    544    428    75    25    72
P7norm.bam    P7tum.bam    chr1.231    chr1    231    deletion    1    1578    0.42712295    C        1.0        674    515    337    387    287    58    457    23    314    25    362    6    281    31    643    1    1819    660    460    398    301    75    25    63
P7norm.bam    P7tum.bam    chr1.248    chr1    248    deletion    1    814    0.4152334    A        1.0        338    243    183    194    144    31    212    15    168    26    168    12    132    38    300    1    931    324    248    207    152    77    26    66
P7norm.bam    P7tum.bam    chr1.255    chr1    255    deletion    1    509    0.45972496    A        1.0        234    149    96    123    111    34    115    11    85    14    109    11    100    25    209    1    625    224    147    135    119    69    20    60
P7norm.bam    P7tum.bam    chr1.304    chr1    304    insertion    1    83    0.25301206        T    1.0        21    44    19    14    7    14    30    3    16    8    6    4    3    12    9    1    380    199    158    16    7    19    12    9
P

Varscan (Indel):

chrom    position    ref    var    normal_reads1    normal_reads2    normal_var_freq    normal_gt    tumor_reads1    tumor_reads2    tumor_var_freq    tumor_gt    somatic_status    variant_p_value    somatic_p_value    tumor_reads1_plus    tumor_reads1_minus    tumor_reads2_plus    tumor_reads2_minus
1    146    A    -C    533    270    33.62%    */-C    409    161    28.25%    */-C    Germline    5.947213588506911E-148    0.9853916902840889    165    244    10    151
1    177    A    +C    436    295    40.36%    */+C    344    234    40.48%    */+C    Germline    2.573218751523094E-189    0.5036502499047071    49    295    9    225
1    230    A    -C    613    218    26.23%    */-C    436    162    27.09%    */-C    Germline    8.727652916131424E-128    0.381197718905699    36    400    5    157
1    247    T    -A    329    156    32.16%    */-A    257    104    28.81%    */-A    Germline    2.9363752902695665E-89    0.869105076258226    31    226    16    88
1    254    T    -A    249    123    33.06%    */-A    165    96    36.78%    */-A    Germline    1.2718779840140742E-76    0.18856251770578447    32    133    19    77
1    329    A    -C    113    26    18.71%    */-C    68    19    21.84%    */-C    Germline    2.491432783762119E-15    0.34120745355261556    28    40    9    10

Varscan (SNP):

chrom    position    ref    var    normal_reads1    normal_reads2    normal_var_freq    normal_gt    tumor_reads1    tumor_reads2    tumor_var_freq    tumor_gt    somatic_status    variant_p_value    somatic_p_value    tumor_reads1_plus    tumor_reads1_minus    tumor_reads2_plus    tumor_reads2_minus
1    109    A    T    749    295    28.26%    W    559    199    26.25%    W    Germline    1.6489449904910897E-166    0.8400429841827789    352    207    121    78
1    180    T    C    463    184    28.44%    Y    379    152    28.63%    Y    Germline    5.019474360557042E-114    0.4973950760340981    49    330    25    127
1    291    C    T    50    97    65.99%    Y    28    67    70.53%    Y    Germline    4.350249434195174E-69    0.27609006399506    7    21    32    35
1    297    C    T    53    83    61.03%    Y    31    59    65.56%    Y    Germline    5.264922068371216E-58    0.29228247724476564    7    24    27    32
1    303    C    T    55    65    54.17%    Y    25    50    66.67%    Y    Germline    5.348385335162754E-46    0.056903610901911816    8    17    24    26
1    309    C    T    49    66    57.39%    Y    25    48    65.75%    Y    Germline    4.640899271738838E-46    0.16096023989817923    7    18    21    27

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

i guess comparing SNP should be straight forward as i look at chromosomal position and the SNP letter (from/to) and if that matches then the SNPs match. Am i correct? I have no idea how to do that for indels? Any comments are highly appreciated.

Thank You All.

indel merge snp output • 6.2k views
ADD COMMENTlink written 8.5 years ago by Hmm500
2
gravatar for lh3
8.4 years ago by
lh331k
United States
lh331k wrote:

When you come to somatic calls, essentially there is no answer because little is known about the characteristics of somatic mutations for a particular cancer. All the answers above are mainly applicable to whole-genome variant discovery. There might be two very subtle criteria: a) your calls are not clustered; b) you get not more than a few thousand calls.

In general, you cannot just run several callers on the same alignment and expect them to give you a reasonable result. They frequently have >90% false positive rate (FPR). The problem is not caused by the caller, but by the alignment. My suggestion is to do two rounds of alignments using distinct algorithms, for example bwa and bwasw (or stampy+ssaha2). Frequently you will find the intersection is much smaller than one of them. This

ADD COMMENTlink written 8.4 years ago by lh331k
1
gravatar for mylons
8.4 years ago by
mylons130
Boston, MA
mylons130 wrote:

You should really just check the common SNP and indel databases, ie dbSNP. Look at the concordance for each SNP caller. That's an objective way of comparing them.

ADD COMMENTlink written 8.4 years ago by mylons130

And let us know how it goes...

ADD REPLYlink written 8.4 years ago by Travis2.8k
1
gravatar for Brett Thomas
8.4 years ago by
Brett Thomas260
Broad Institute
Brett Thomas260 wrote:

One tool you can use is PLINK/SEQ. (Disclaimer: I work on the project.) It's under active development, but can be used to compare output if you have VCF or PED files. One method would be to use the v-stats command to calculate, for example, the Ti/Tv ratio. (see the tutorial)

ADD COMMENTlink written 8.4 years ago by Brett Thomas260
1
gravatar for Rm
8.4 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Comparing multiple variant calls to a set of known variant loci and their alleles, such as dbSNP or the universal HapMap database is will be a good option. To obtain intersections of multiple variants use bedtools "intersectBed".

Indels can also be processed in similar fashion as variants; but length of indels may vary a given location as predicted by multiple tools. in that case you can also obtain % of overlap.

ADD COMMENTlink written 8.4 years ago by Rm7.9k

@RM -- i will miss any novel SNPs this way. I was thinking of first intersect and then compare to dbsnp and get a list of known vs unknown.

ADD REPLYlink written 8.4 years ago by Hmm500

SNV's unmapped by the dbSNP will be unkwon or novel SNV's...

ADD REPLYlink written 8.4 years ago by Rm7.9k
1
gravatar for Bo Peng
8.2 years ago by
Bo Peng10
Bo Peng10 wrote:

A new tool that seem to be perfect for this job is "variant tools" varianttools.sourceforge.net). Although some of the formats are not yet supported, you can import your variants using customized format specification. Once all the variants are imported, you can separate them by source using command "vtools select --samples" and create variant tables for each file format. You can then compare these tables using command "vtools compare". A lot more can be done using other commands.

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Bo Peng10

Any publication planned?

ADD REPLYlink written 8.2 years ago by Travis2.8k

Under review. :-)

ADD REPLYlink written 8.2 years ago by Bo Peng10

Still under review. :-)

We understand that there are many different formats from different calling algorithms so we have developed an "input format specification" format that can be used to describe and import these formats (http://varianttools.sourceforge.net/Main/Documentation). I have just finished formats for Illumina CASAVA 1.8 and I am working on importing our BGI data. Please feel free to try it out. We will be happy to post your .fmt files so that others can use them.

ADD REPLYlink written 8.2 years ago by Bo Peng10

the publication is available here now: http://bioinformatics.oxfordjournals.org/content/28/3/421.abstract?sid=f64403e7-5050-4102-963c-e690efe003f7

ADD REPLYlink modified 9 weeks ago by RamRS25k • written 7.8 years ago by Sophia300
1
gravatar for Bo Peng
8.1 years ago by
Bo Peng10
Bo Peng10 wrote:

We have recently encountered a similar problem and have put together a tutorial on how to import and compare variants called by different platforms: http://varianttools.sourceforge.net/Tutorial/CompareVariants .

ADD COMMENTlink written 8.1 years ago by Bo Peng10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1782 users visited in the last hour