GWAS Manhattan plot shows no SNPs near centromeres — is this result reliable?
2
0
Entering edit mode
2 days ago

I am performing a GWAS analysis in R using the rMVP package with the MLM (mixed linear model) method.
However, in my Manhattan plot, no SNPs appear near the centromeric regions of each chromosome, which looks quite different from the GWAS results in many published studies.

I would like to ask:

  • Is this kind of pattern biologically or technically reasonable?
  • What steps should I take to further verify the reliability of my results?

SNP annotation summary

After SNP calling and hard filtering, the annotation statistics are as follows:

Parameter Range Median Mean
QD 0-50 19.11 20.09
FS 0-20 (dense) 2.15 57.63
MQ around 60 - 57.63
MQRankSum around 0 -0.24 -
ReadPosRankSum around 0 0.09 -
MAF 0-0.15 0.09 0.15
Missing rate mostly <0.25 0.17 0.21

Hard filtering conditions:

--filter-expression "MQ < 60.0" --filter-name "MQ60"
--filter-expression "FS > 5.0" --filter-name "FS5"
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "MQRankSum < -0.1 || MQRankSum > 0.1" --filter-name "BadMQRankSum"
--filter-expression "ReadPosRankSum < -0.1 || ReadPosRankSum > 0.1" --filter-name "BadReadPos"
rmvp SNP gwas • 542 views
ADD COMMENT
0
Entering edit mode

Additional information

I will provide two figures: SNP distribution along chromosomes enter image description here Composite figure including: (A) Kinship matrix heatmap (B) QQ plot (C) Manhattan plot enter image description here

ADD REPLY
0
Entering edit mode

Is your data coming from sequencing or arrays?

ADD REPLY
0
Entering edit mode

It comes from sequencing.

ADD REPLY
0
Entering edit mode

Well, thats why it looks different from what you see in the usual GWAS papers then - GWAS by sequencing is a very recent development. Almost all GWAS until this point has been done by array, and therefore doesn't suffer from a mappability problem.

ADD REPLY
4
Entering edit mode
1 day ago

Hi,

First off, that pattern in your Manhattan plot --gaps around the centromeres-- is pretty common and usually not a red flag for your analysis. Technically, it's reasonable because centromeric regions are notorious for being tough nuts to crack in sequencing: they're highly repetitive, have weird GC content, and reads map poorly there (or not at all). Your hard filters (especially MQ < 60 and the rank sum tests) are spot-on for GATK best practices, but they naturally cull SNPs in those areas since mapping quality tanks and strand bias creeps in from the repeats. Biologically, it makes sense too—recombination is suppressed near centromeres, so polymorphism is often lower anyway, especially if your species has that heterochromatin setup (assuming this isn't human; if it is, same deal).

Your summary stats look solid overall: median MAF at 0.09 is decent for discovery, missingness under 0.25 is fine, and those QD/FS values aren't screaming problems. The mean FS at 57.63 jumps out a bit (even with your >5 filter), but if it's driven by a few outliers post-filter, it might just be noise from dense regions elsewhere.

To double-check reliability, here's a stepwise plan—focus on diagnostics first, then poke at the data:

  1. QQ plot and genomic control lambda: This is your quick gut-check for model fit. In rMVP, after x<-MVP() , grab the p-values and plot:

    qqplot(-log10(1:p), -log10(gwas$p.values), main="QQ Plot")
    abline(0,1)
    lambda <- median(qchisq(1-gwas$p.values,1), na.rm=T) / qchisq(0.5,1)
    

    Lambda around 1-1.05 is golden; >1.1 might mean unaccounted structure (try adding more PCs if you have them).

  2. SNP density by chromosomal position: Slice your VCF or genofile by position and histogram it per chromosome to visualize those gaps. Load via vcfR (it's stable, no updates needed):

    library(vcfR)
    vcf <- read.vcfR("your_filtered.vcf.gz")
    pos <- getFIX(vcf)$POS
    chrom <- getCHROM(vcf)
    # For chr1 example
    hist(pos[chrom=="1"], breaks=1000, main="SNP Density Chr1")
    # Overlay known centromere coords if you have 'em (e.g., from UCSC)
    

    If gaps align with centromeres (grab coords from your genome browser), you're good—it's not your filters eating signal.

  3. Mappability track: Pull a mappability score (e.g., from ENCODE or your aligner logs) and correlate it with your SNP positions. Low mappability (<0.8) should match your gaps. If not, loosen MQ to 50 and re-run filtering to see if centromeric SNPs pop up (but watch for false positives).

  4. Cross-validate the model: Re-run MLM without kinship (if you're using it) or try FarmCPU in rMVP for comparison—MVP.Config(method="FarmCPU"). If peaks shift but centromere gaps stay, it's technical. Also, subsample your SNPs (e.g., LD-prune harder) and re-plot to rule out over-pruning.

  5. External benchmark: Compare your SNP set to a public panel for your species (e.g., dbSNP or a ref GWAS). Overlap and see if your missing centromeric hits are also absent there.

If lambda's off or densities look wonky beyond centromeres, dig into population structure (PCA on your genotypes) or batch effects. Otherwise, this is likely just the data behaving as expected—plenty of published plant GWAS (guessing from rMVP?) show those barren zones.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you for your professional reply. I will try the suggestions you provided.

ADD REPLY
0
Entering edit mode

I also have a follow-up question. My calculated genomic control lambda is 0.83. As shown in the QQ plot (see Additional information), although there is some deviation (“upward tail”) at the smallest p-values, for the larger p-values the points fall slightly below the expected line.

Since I did not include any covariates (CV) when running MVP.MLM, could this lambda < 1 indicate that my model is too conservative? Specifically, should I consider loosening the hard filtering thresholds in GATK to retain more SNPs?

ADD REPLY
1
Entering edit mode
2 days ago

I don't know how this is usually handled in GWAS studies, but the peri-centromeric regions of chromosomes (as well as the short arms of acrocentric chromosomes) are highly repetative. The centromeric regions themsevles are simple arrays of microsatelites, and as such, it is generally not possible to align reads or call SNPs in these regions.

What I don't really know of the top of my head is how large you'd expect these uncalled regions to be.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. For real, I am not very clear on what the typical size of regions without called SNPs in GWAS should be. I am more concerned that having almost no SNPs in the centromeric regions might indicate a technical issue with SNP calling, filtering, or genome assembly.

ADD REPLY

Login before adding your answer.

Traffic: 3869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6