I ran verifyBamID on our data to check sample swap and contamination.
Our data: whole exome sequence data on normal-tumor paired samples for 95 patients.
I first found genotype for all patients from WES of the 95 normal samples, and then ran verifyBamID for each tumor sample against the genotype data. The command I used is:
verifyBamID --bam ${GATK_BQSR_dir}/${Tumor}.recal.bam --vcf ${GermlineMutations} --out ${verifyBamID_dir}/${batch}_${Tumor} --best –ignoreRG
The output I have as .bestSM: (Sorry, it is difficult to read)
SEQ_ID RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT
T9 ALL B134340 6967873 14171034 2.03 0.04996 3595962.37 3635330.39 NA NA 0.97661 3618996.71 4708154.17 NA NA 7.531 2.1034 0.7001
SEQ_ID RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT
T28 ALL B231 6967873 15977931 2.29 0.04260 3905908.38 3945090.62 NA NA 0.43404 3687074.01 4376275.50 NA NA 3.674 2.2191 0.7342
T3 ALL B230 6967873 16995487 2.44 0.05497 4096602.97 4154965.61 NA NA 0.53847 3940187.43 4585371.56 NA NA 5.192 2.2639 0.9876
T41 ALL B578 6967873 16892777 2.42 0.05380 4576180.58 4625879.26 NA NA 0.37675 4374536.53 5061819.99 NA NA 4.600 2.2189 0.9553
T34 ALL B148 6967873 14778134 2.12 0.03513 3621392.88 3649936.94 NA NA 0.39406 3439631.00 3994788.97 NA NA 4.364 2.3758 0.9488
T37 ALL B146 6967873 18465313 2.65 0.03608 4465654.97 4503486.46 NA NA 0.51553 4328126.49 5016103.08 NA NA 5.035 2.3685 0.9142
The first sample in the list is matched to wrong normal sample. I want to clarify that the matched normal sample to this tumor sample cannot be sample swap, because they are from different research institutes, there is no chance for them to be swapped.
I would like to know how well the matched normal sample found by the software is matching to the tumor sample? Is there any score that can tell about this?
Is there any explanation why B134340 is identified as best match to T9? Can we explain it as T9 has high CHIPMIX score which is close to 1, and this means that T9 is highly contaminated?
CHIPMIX scores are mostly very high, about 0.5. Why?
What is difference between CHIPMIX and FREEMIX? In our data, we are using only whole exome sequence data. The genotype we are using to find best match with verifyBamID is also found from WES. In our case, what does these scores mean?
I read the documents for verifyBamID more carefully.
According to what the document says, though T9 is best matched with B134340, this is not really the actual swapped sample, because CHIPMIX is close to 1. Only if CHIPMIX is close to 0, it possibly is the swapped sample.
But what does it mean that CHIPMIX for all other samples are so large, about 0.5?