First off, bare with me because I am a 16 year old research intern who has no idea what he is doing.
I was given two groups of .vcf files, with each group consisting of individuals with a different phenotype of muscular dystrophy. My goal is to determine if there are large groups of gene loci that each group has in common but is different between the two groups.
I started out by merging the .vcf files from each group into one file (This took longer than you would expect). Then I ran some comparisons on the two files. These are my results:
- This file was generated by vcf-compare.
- The command line was: vcf-compare(r953) earlyCombined.vcf.gz lateCombined.vcf.gz
- VN 'Venn-Diagram Numbers'. Use
grep ^VN | cut -f 2-
to extract this part. - VN The columns are:
- VN 1 .. number of sites unique to this particular combination of files
- VN 2- .. combination of files and space-separated number, a fraction of sites in the file VN 56741 earlyCombined.vcf.gz (30.4%) VN 60599 lateCombined.vcf.gz (31.8%) VN 129749 earlyCombined.vcf.gz (69.6%) lateCombined.vcf.gz (68.2%)
SN Summary Numbers. Use
grep ^SN | cut -f 2-
to extract this part. SN Number of REF matches: 128958 SN Number of ALT matches: 125013 SN Number of REF mismatches: 791 SN Number of ALT mismatches: 3945 SN Number of samples in GT comparison: 0GC Genotype Comparison. Use
grep ^GC | cut -f 2-
to extract this part.- GC The columns are:
- GC 1 .. Sample
- GC 2-6 .. Gtype mismatches: total hom_RR hom_AA het_RA het_AA
- GC 7-9 .. Gtype lost: total het_RA het_AA
- GC 10-14 .. Gtype gained: total hom_RR hom_AA het_RA het_AA
- GC 15-17 .. Phase lost: total het_RA het_AA
- GC 18 .. Phase gained
- GC 19-23 .. Matching sites: total hom_RR hom_AA het_RA het_AA
- GC 24 .. Phased matches: het_RA
- GC 25 .. Misphased matches: het_RA GC - 14456 0 7108 7348 0 24449 17070 0 22792 3 6357 16351 81 0 0 0 0 32421 0 16514 15907 0 00
- AF Number of matching and mismatching genotypes vs non-ref allele frequency. Use
^AF | cut -f 2-
to extract this part. - AF The columns are:
- AF 1 .. Non-ref allele count
- AF 2 .. Hom(RR) matches
- AF 3 .. Het(RA) matches
- AF 4 .. Hom(AA) matches
- AF 5 .. Het(AA) matches
- AF 6 .. Hom(RR) mismatches
- AF 7 .. Het(RA) mismatches
- AF 8 .. Hom(AA) mismatches
- AF 9 .. Het(AA) mismatches AF 0.50 0 15907 0 0 0 7348 0 0 AF 1.00 0 0 16514 0 0 0 7108 0
- DP Counts by depth. Use
grep ^DP | cut -f 2-
to extract this part. - DP The columns are:
- DP 1 .. depth
- DP 2 .. RR matches
- DP 3 .. RA matches
- DP 4 .. AA matches
- DP 5 .. RR -> RA mismatches
- DP 6 .. RR -> AA mismatches
- DP 7 .. RA -> RR mismatches
- DP 8 .. RA -> AA mismat
I am struggling with interpreting this data and I want to know if there is a way to use vcftools and find similar loci in each file that is not shared with the other file.
Any help would be greatly appreciated!
EDIT
After further review, I think using vcf-isec might be what I'm looking for. I will update again after trying for future reference.