Question: Comparing Genotypes In Two Vcf Files
8
gravatar for enza.colonna
6.8 years ago by
enza.colonna80
enza.colonna80 wrote:

Hi, I would like to quick compare genotypes in two VCF files for a set of individuals. Any quick method?

vcf genotyping • 14k views
ADD COMMENTlink modified 4.5 years ago by Anop Singh Ranawat0 • written 6.8 years ago by enza.colonna80

This file was generated by vcf-compare.

The command line was:

vcf-compare(r840) A_bowtie2.marked.ESAM.flt.vcf.gz A_BWA.marked.ESAM.flt.vcf.gz A_novoalign.marked.ESAM.flt.vcf.gz
VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
VN The columns are: 
VN        1  .. number of sites unique to this particular combination of files
VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN    3537    A.marked.ESAM.flt.vcf.gz (0.9%)    A_novoalign.marked.ESAM.flt.vcf.gz (0.5%)
VN    4506    A_BWA.marked.ESAM.flt.vcf.gz (0.7%)    A_bowtie2.marked.ESAM.flt.vcf.gz (1.1%)
VN    14668    A_bowtie2.marked.ESAM.flt.vcf.gz (3.6%)
VN    68272    A_BWA.marked.ESAM.flt.vcf.gz (10.2%)
VN    105267    A_novoalign.marked.ESAM.flt.vcf.gz (14.9%)
VN    210565    A_BWA.marked.ESAM.flt.vcf.gz (31.4%)    A_novoalign.marked.ESAM.flt.vcf.gz (29.8%)
VN    387387    A_BWA.marked.ESAM.flt.vcf.gz (57.8%)    A_bowtie2.marked.ESAM.flt.vcf.gz (94.5%)    A_novoalign.marked.ESAM.flt.vcf.gz (54.8%)

SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN    Number of REF matches:    386460
SN    Number of ALT matches:    376359
SN    Number of REF mismatches:    927
SN    Number of ALT mismatches:    10101
SN    Number of samples in GT comparison:    0

Can anyone explain what vcf-compare output means please?

Also when using a reference to compare sequence what use of uninitialized value... message means and the output? confused!

Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 83918.
Use of uninitialized value in subtraction (-) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1133, <$__ANONIO__> line 81038.
Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 81038.
Use of uninitialized value in subtraction (-) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1133, <$__ANONIO__> line 83921.
Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 83921.
VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
VN The columns are:
VN        1  .. number of sites unique to this particular combination of files
VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN    3537    A_bowtie2.marked.ESAM.flt.vcf.gz (0.9%)    A_novoalign.marked.ESAM.flt.vcf.gz (0.5%)
VN    4506    A_BWA.marked.ESAM.flt.vcf.gz (0.7%)    A_bowtie2.marked.ESAM.flt.vcf.gz (1.1%)
VN    14668    A_bowtie2.marked.ESAM.flt.vcf.gz (3.6%)
VN    68272    A_BWA.marked.ESAM.flt.vcf.gz (10.2%)
VN    105267    A_novoalign.marked.ESAM.flt.vcf.gz (14.9%)
VN    210565    A_BWA.marked.ESAM.flt.vcf.gz (31.4%)    A_novoalign.marked.ESAM.flt.vcf.gz (29.8%)
VN    387387    A_BWA.marked.ESAM.flt.vcf.gz (57.8%)    A_bowtie2.marked.ESAM.flt.vcf.gz (94.5%)    A_novoalign.marked.ESAM.flt.vcf.gz (54.8%)
ADD REPLYlink modified 14 days ago by RamRS24k • written 6.4 years ago by robert0

-m is not work in vcf-compare option

vcf-compare -m *.vcf.gz
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\232&quot; at /usr/local/bin/vcf-compare line 142.
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\362&quot; at /usr/local/bin/vcf-compare line 142.
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\026&quot; at /usr/local/bin/vcf-compare line 142.
Expected 1 column names, found [�BC�(�}{o�Ȗ�ߙOA�`0   Fq��,V��n@��m��R���-�6o˒V����|�=��EI��$
                                                                                                  �}qc��T<<�:�z��&lt;���q:�'���_�#���CO�EpBm�l�&@����08��gn���5z����G���ç˫����.   q)��c0&������#�4{��p�t2������?�E8��ͧ�`�x=��/�d�������Ŭ��/���d� M�/G����i���h2�%��藘�pq��`
<������_������!���V�u�\N�S0o��p���'����.O���Qpd�#k������wvw[����ܚM�����l:_#+�h�:���h����S����5�����0�&gt;��(���|W`�������}N�o%�?���a��B�����3��w���5}���I`E�/�q[���s`~�f8��y`��q���^-�N��Bo��T��
                                            ��Xh    ]�Cl=�H���z�p<�<�2��%nK}����{�^{��DJ;S�MVg<Ɓu;��I۲���r�8�0jY���5hY��MA��a�r��s����:��-�a
                                                                                                                                                          �>�#�jw:].
 at /usr/local/bin/vcf-compare line 32.
    main::error('Expected 1 column names, found [\x{1f}\x{8b}\x{8}\x{4}\x{0}\x{0}\x{0}\x{0}\x{0}\x{ff}\x{6}\x{0}BC\x{2}\x{0}\x{f5}(\x{ed}}{o\x{db}\x{c8}\x{96}\x{e7}\x{df}\x{99}O...') called at /usr/local/bin/vcf-compare line 142
    main::read_mappings_list('clinvar.vcf.gz', 'ARRAY(0xac0ee0)') called at /usr/local/bin/vcf-compare line 163
    main::compare_vcfs('HASH(0xac0b50)') called at /usr/local/bin/vcf-compare line 22
ADD REPLYlink modified 14 days ago by RamRS24k • written 4.5 years ago by Anop Singh Ranawat0
9
gravatar for Erik Garrison
6.8 years ago by
Erik Garrison2.2k
Napoli, IT / UCSC
Erik Garrison2.2k wrote:

If you want to directly compare genotypes, look at vcfgtcompare in vcflib. You can use it like this:

vcfgtcompare.sh other this.vcf other.vcf

The output will be this.vcf with genotype concordance annotations provided in the INFO column:

##INFO=<ID=other.has_variant,Number=0,Type=Flag,Description="True if other has a called alternate among samples under comparison.">
##INFO=<ID=other.genotypes.AA_AA,Number=1,Type=Integer,Description="Number of genotypes with AA_AA relationship with other">
##INFO=<ID=other.genotypes.AA_AR,Number=1,Type=Integer,Description="Number of genotypes with AA_AR relationship with other">
##INFO=<ID=other.genotypes.AA_RR,Number=1,Type=Integer,Description="Number of genotypes with AA_RR relationship with other">
##INFO=<ID=other.genotypes.AA_NN,Number=1,Type=Integer,Description="Number of genotypes with AA_NN relationship with other">
##INFO=<ID=other.genotypes.AR_AA,Number=1,Type=Integer,Description="Number of genotypes with AR_AA relationship with other">
... etc.

I typically use the other.genotypes.AA_AA, other.genotypes.AA_AR, etc. to tabulate concordance statistics. Here AA_AR means that in the first file the genotype was alternate/alternate, and in the second it was alternate/reference. N means "no-call" or partial no-call, which might happen if you were to remove non-intersecting alleles using vcfintersect. To do this for many loci, I drop the data into tab-separated format using vcf2tsv:

% vcfgtcompare.sh other this.vcf other.vcf | vcf2tsv >this_other_comparison.tsv
% R
> this_other <- read.delim("this_other_comparison.tsv")
> this_other[is.na(this_other)] <- 0  # converts NA's to 0 for later processing

Then I use this function to tabulate genotype concordance in R:

gterror <- function(s, n) {
   with(s,
   (sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))
   / ((sum(eval(as.symbol(paste(n, ".genotypes.RR_RR", sep=""))))
       + sum(eval(as.symbol(paste(n, ".genotypes.AA_AA", sep=""))))
       + sum(eval(as.symbol(paste(n, ".genotypes.AR_AR", sep="")))))
      + (sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))))
}

In this way (in R):

> gterror(this_other, "other")  # yields rate of discordance between the two sets
ADD COMMENTlink modified 14 days ago by RamRS24k • written 6.8 years ago by Erik Garrison2.2k

Thanks Erik, does it overwrite this.vcf?

ADD REPLYlink written 6.8 years ago by enza.colonna80

No, it does not. With a few exceptions, all of the tools in vcflib read VCF (often on stdin) and write VCF on stdout. So here, the VCF stream which it prints on stdout is the result. The inputs are unchanged.

ADD REPLYlink written 6.8 years ago by Erik Garrison2.2k

thanks Erik for clarifying this.

ADD REPLYlink written 6.8 years ago by enza.colonna80
4
gravatar for Pierre Lindenbaum
6.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

There is a tool "vcf-compare" in vcftools.

ADD COMMENTlink modified 14 days ago by RamRS24k • written 6.8 years ago by Pierre Lindenbaum123k

Thanks Pierre, this is useful.

ADD REPLYlink written 6.8 years ago by enza.colonna80
0
gravatar for swbarnes2
6.8 years ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

You can also use BEDTools to compare two vcfs.

ADD COMMENTlink written 6.8 years ago by swbarnes26.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1968 users visited in the last hour