Question: Question about the reliability of --check-sex in PLINK2
gravatar for boymin2020
2.4 years ago by
boymin202020 wrote:

Hi all,

I have tried to use plink2 to check the gender situation of my several datasets. It always shows more problem samples than expected. For some datasets, I also have RNAseq data. Then I checked again with some genes just on chromosome X or Y. It turned out a few problem samples. Therefore, I doubt the reliability of plink sex check. Or maybe, I have some wrong steps/parameters in my codes. The weirdest thing is plink2 can not scan variants on CHROM Y while a number of variants (7375) do exist. The below is an example log file. In addition, there are 45911 variants on CHROM X, but in the log file, it scans 40431. Any advice or comments will be appreciated.

PLINK v1.90b4 64-bit (20 Mar 2017) (C) 2005-2017 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to WGS-GATK-bcftools-Plink-update3.qc.log. Options in effect: --bfile WGS-GATK-bcftools-Plink-update3 --check-sex 0.35 0.65 --noweb --out WGS-GATK-bcftools-Plink-update3.qc

Note: --noweb has no effect since no web check is implemented yet. 16384 MB RAM detected; reserving 8192 MB for main workspace. 2120644 variants loaded from .bim file. 166 people (93 males, 73 females) loaded from .fam. 158 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 166 founders and 0 nonfounders present. Calculating allele frequencies... done. Warning: 904336 het. haploid genotypes present (see WGS-GATK-bcftools-Plink-update3.qc.hh ); many commands treat these as missing. Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing. Total genotyping rate is 0.928209. 2120644 variants and 166 people pass filters and QC. Among remaining phenotypes, 142 are cases and 16 are controls. (8 phenotypes are missing.) --check-sex: 40431 Xchr and 0 Ychr variant(s) scanned, 86 problems detected. Report written to WGS-GATK-bcftools-Plink-update3.qc.sexcheck .

sex-check plink2 • 3.4k views
ADD COMMENTlink modified 2.4 years ago by chrchang5237.6k • written 2.4 years ago by boymin202020

I don't think you're using plink2 - log says plink v1.90b4.

Also, --check-sex is just one indicator that you should use in conjunction with other checks (such as --genome) and a PEDIGREE file to deduce actual problem samples. Plink does flag a number of samples, but aren't we better off with a few extra samples to scrutinize?

I'd recommend picking a set of high quality variants on the X and Y chromosomes to use sex-check on. Run sex check using X as well as Y chromosomes and combine the results for better insight. Samples that have an actual sex swap with raise flags with both sex checks, plus the anomalies in the F score and chrX homozygosity values will be strong enough to warrant deeper digging.

Tagging chrchang523 for more insights

ADD REPLYlink written 2.4 years ago by _r_am32k
gravatar for chrchang523
2.4 years ago by
United States
chrchang5237.6k wrote:

(This response is also posted in plink2-users.)

  • --check-sex ignores chrY for backward compatibility with plink 1.07. See for documentation on the 'ycount' modifier for taking # of nonmissing chrY genotypes into account.
  • If you didn't use --split-x (or plink 2.0's --split-par) before running --check-sex, you should; legitimate male heterozygous genotypes in the pseudoautosomal regions confuse the --check-sex algorithm.
  • Appropriate F-statistic thresholds are pretty data-dependent. Don't worry too much about "X problems detected" during your first --check-sex run; instead, load the output file in R/Python/etc. and make a F-statistic against sex-label plot. If almost all the male F-statistics are in a clump on the right and the female F-statistics are spread out, you're fine. If they're randomly mixed with each other, there's almost certainly some sample-labeling mixup.
ADD COMMENTlink written 2.4 years ago by chrchang5237.6k

I recall using the --extract option to use only a specific set of SNPs, and excluding the PAR while creating that set of SNPs. Does that work?

ADD REPLYlink written 2.4 years ago by _r_am32k

Yes, that works; just need to make sure PAR variants are not included.

ADD REPLYlink written 2.4 years ago by chrchang5237.6k

Thanks for your reply. Another two questions.

  1. How the select the specific set of SNPs?
  2. What is the meaning of PAR?
ADD REPLYlink modified 2.4 years ago by _r_am32k • written 2.4 years ago by boymin202020
  1. Use an allele frequency threshold (>0.01, say) and ensure they have rs numbers - this is just one way.
  2. PAR = pseudo-autosomal regions.
ADD REPLYlink written 2.4 years ago by _r_am32k
  1. Use --freq to get the MAF report, then extract variants (>0.01) with rs outside of PAR?
ADD REPLYlink written 2.4 years ago by boymin202020

That's pretty much it, yes.

ADD REPLYlink written 2.4 years ago by _r_am32k

Thank you very much.

ADD REPLYlink written 2.4 years ago by boymin202020
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2080 users visited in the last hour