Comparing two variant files
0
0
Entering edit mode
5 weeks ago

Hello all,

I have two compare two excel files containing variants in sample 1 and sample 2. For this, which columns we have to take into consideration to find match and mismatch between two files.

Kindly suggest.

Files contain columns as :

Locus   Genotype    Filter  Ref Observed Allele Type    No Call Reason  Genes   Location    Length  Copy Number CytoBand    Info    Variant ID  Variant Name    % Frequency Strand  Exon    Transcript  Coding  Amino Acid Change   Variant Effect  PhyloP  SIFT    Grantham    PolyPhen    FATHMM  PFAM    dbSNP   DGV MAF EMAF    AMAF    GMAF    UCSC Common SNPs    ExAC LAF    ExAC EAAF   ExAC OAF    ExAC EFAF   ExAC SAAF   ExAC ENFAF  ExAC AAF    ExAC GAF    COSMIC  OMIM    Gene Ontology   DRA DrugBank    ClinVar Allele Coverage Allele Ratio    CNV Confidence  p-value Phred QUAL Score    Coverage    CNV Precision   Ref+/Ref-/Var+/Var- Homopolymer Length  Tiles   Subset of

variant comparison • 619 views
0
Entering edit mode

It sounds like you might want bcftools isec; https://samtools.github.io/bcftools/bcftools.html#isec

I have recently been using it to compare variants between multiple samples. In particular, the sites.txt file that gets output has the list of all variant regions specified in the input regions list, and details which samples had that mutation. It also outputs the variants unique to each sample.

Note that it seems like some older versions of bcftools did not produce the sites.txt when less than 3 samples were being compared, though that seems to be updated in newer versions.

Also note that bcftools isec requires .vcf input files, so convert your files into that format

0
Entering edit mode

Will the bcftool accept input files in xlsx format..?

1
Entering edit mode

It would be best to use @steve's solution with VCF files if you have those. Excel files can't be used with bcftools (or many other standard VCF tools).

0
Entering edit mode

If you have xlsx files, i would suggest to use MS access for such complex joins. Unless you post content of the xlsx file here (2-3 lines, not the images) and explain what you are trying to achieve, it is difficult to address the issue.

0
Entering edit mode

@steve unfortunately following bit in original posting makes it so that your solution can't be applied in this case (unless OP has VCF files available). I will move your answer to a comment for now.

I have two compare two excel files containing variants in sample 1 and sample 2 (with custom columns)

0
Entering edit mode

well the solution is to convert the files to vcf format of course :)

data scrubbing and conversion is pretty par for the course with bioinformatics

0
Entering edit mode

Than you so much Steve.

I tried with .VCF files but I am not getting sites.txt file

bcftools --version

bcftools 1.12 Using htslib 1.12 Copyright (C) 2021 Genome Research Ltd. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

bcftools isec -p resulrs A.vcf.gz B.vcf.gz

0
Entering edit mode

without posting input files, it is difficult to address the issue. Take two small vcf files (subset from your vcf files) first, with overlapping records and see if isec works. Then extend it to entire vcf files.

0
Entering edit mode

As mentioned by many people here, BCFtools isec is a good way, but you need a VCF file for that.

There is another way which can utilise your excel file. You just need to create an additional column and make a custom variant specific ID in the form- Locus OR Location:Ref:Oberved Allele. You can use the CONCAT() of excel to make this ID.

I can see that there is a column names Variant ID as well which may just contain a rsid (reference SNP ID number). The issue with it is that not all variants will have a rsid (especially if its a private variant). In those cases, a custom ID is helpful.

You make the custom ID for both of your excel files and just compare the ID column between the two files (venny can be helpful here). You can make an extremely specific ID as well by adding Gene name/Exon, etc.

This can also be easily achieved using R.

0
Entering edit mode

To add to prasundutta87's comment, if you can create an ID out of CHROM, POS, REF and ALT, you won't need any other field. The combination of these 4 will always be unique for normalized (multi-allelics decomposed) VCF data.

0
Entering edit mode

I agree. Adding information like gene names/trascript ID/exon, etc. is helpful when you have two files with variants annotated by two different variant annotation software. However, in this case, the basic custom ID described above will be enough..

0
Entering edit mode

Hello Guys,

You are right but I am able to successfully run bcftools, there is no issue but only problem is I am not getting sites.txt which was expected to be generated with version 1.12 as mentioned in the above replies.

0
Entering edit mode

Check out the bcftools manual (https://samtools.github.io/bcftools/bcftools.html#isec) in order to see if you are running the command correctly or it is giving output in some other format..it is also possible that there may not be any common sites. You can try checking if you find common sites using a custom ID as well..