Question

GWAS Manhattan plot shows no SNPs near centromeres — is this result reliable?

0

Entering edit mode

2 days ago

straywilddog • 0

I am performing a GWAS analysis in R using the rMVP package with the MLM (mixed linear model) method.
However, in my Manhattan plot, no SNPs appear near the centromeric regions of each chromosome, which looks quite different from the GWAS results in many published studies.

I would like to ask:

Is this kind of pattern biologically or technically reasonable?
What steps should I take to further verify the reliability of my results?

SNP annotation summary

After SNP calling and hard filtering, the annotation statistics are as follows:

Parameter	Range	Median	Mean
QD	0-50	19.11	20.09
FS	0-20 (dense)	2.15	57.63
MQ	around 60	-	57.63
MQRankSum	around 0	-0.24	-
ReadPosRankSum	around 0	0.09	-
MAF	0-0.15	0.09	0.15
Missing rate	mostly <0.25	0.17	0.21

Hard filtering conditions:

--filter-expression "MQ < 60.0" --filter-name "MQ60"
--filter-expression "FS > 5.0" --filter-name "FS5"
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "MQRankSum < -0.1 || MQRankSum > 0.1" --filter-name "BadMQRankSum"
--filter-expression "ReadPosRankSum < -0.1 || ReadPosRankSum > 0.1" --filter-name "BadReadPos"

rmvp SNP gwas • 534 views

ADD COMMENT • link 1 day ago by straywilddog • 0

0

Entering edit mode

Additional information

I will provide two figures: SNP distribution along chromosomes enter image description here Composite figure including: (A) Kinship matrix heatmap (B) QQ plot (C) Manhattan plot

ADD REPLY • link 2 days ago by straywilddog • 0

0

Entering edit mode

Is your data coming from sequencing or arrays?

ADD REPLY • link 2 days ago by i.sudbery 22k

0

Entering edit mode

It comes from sequencing.

ADD REPLY • link 2 days ago by straywilddog • 0

0

Entering edit mode

Well, thats why it looks different from what you see in the usual GWAS papers then - GWAS by sequencing is a very recent development. Almost all GWAS until this point has been done by array, and therefore doesn't suffer from a mappability problem.

ADD REPLY • link 2 days ago by i.sudbery 22k

score 4 · Answer 1 · 2025-11-10

Hi,

First off, that pattern in your Manhattan plot --gaps around the centromeres-- is pretty common and usually not a red flag for your analysis. Technically, it's reasonable because centromeric regions are notorious for being tough nuts to crack in sequencing: they're highly repetitive, have weird GC content, and reads map poorly there (or not at all). Your hard filters (especially MQ < 60 and the rank sum tests) are spot-on for GATK best practices, but they naturally cull SNPs in those areas since mapping quality tanks and strand bias creeps in from the repeats. Biologically, it makes sense too—recombination is suppressed near centromeres, so polymorphism is often lower anyway, especially if your species has that heterochromatin setup (assuming this isn't human; if it is, same deal).

Your summary stats look solid overall: median MAF at 0.09 is decent for discovery, missingness under 0.25 is fine, and those QD/FS values aren't screaming problems. The mean FS at 57.63 jumps out a bit (even with your >5 filter), but if it's driven by a few outliers post-filter, it might just be noise from dense regions elsewhere.

To double-check reliability, here's a stepwise plan—focus on diagnostics first, then poke at the data:

QQ plot and genomic control lambda: This is your quick gut-check for model fit. In rMVP, after x<-MVP() , grab the p-values and plot:
```
qqplot(-log10(1:p), -log10(gwas$p.values), main="QQ Plot")
abline(0,1)
lambda <- median(qchisq(1-gwas$p.values,1), na.rm=T) / qchisq(0.5,1)
```
Lambda around 1-1.05 is golden; >1.1 might mean unaccounted structure (try adding more PCs if you have them).
SNP density by chromosomal position: Slice your VCF or genofile by position and histogram it per chromosome to visualize those gaps. Load via vcfR (it's stable, no updates needed):
```
library(vcfR)
vcf <- read.vcfR("your_filtered.vcf.gz")
pos <- getFIX(vcf)$POS
chrom <- getCHROM(vcf)
# For chr1 example
hist(pos[chrom=="1"], breaks=1000, main="SNP Density Chr1")
# Overlay known centromere coords if you have 'em (e.g., from UCSC)
```
If gaps align with centromeres (grab coords from your genome browser), you're good—it's not your filters eating signal.
Mappability track: Pull a mappability score (e.g., from ENCODE or your aligner logs) and correlate it with your SNP positions. Low mappability (<0.8) should match your gaps. If not, loosen MQ to 50 and re-run filtering to see if centromeric SNPs pop up (but watch for false positives).
Cross-validate the model: Re-run MLM without kinship (if you're using it) or try FarmCPU in rMVP for comparison—MVP.Config(method="FarmCPU"). If peaks shift but centromere gaps stay, it's technical. Also, subsample your SNPs (e.g., LD-prune harder) and re-plot to rule out over-pruning.
External benchmark: Compare your SNP set to a public panel for your species (e.g., dbSNP or a ref GWAS). Overlap and see if your missing centromeric hits are also absent there.

If lambda's off or densities look wonky beyond centromeres, dig into population structure (PCA on your genotypes) or batch effects. Otherwise, this is likely just the data behaving as expected—plenty of published plant GWAS (guessing from rMVP?) show those barren zones.

Kevin

score 1 · Answer 2 · 2025-11-10

1

Entering edit mode

2 days ago

i.sudbery 22k

I don't know how this is usually handled in GWAS studies, but the peri-centromeric regions of chromosomes (as well as the short arms of acrocentric chromosomes) are highly repetative. The centromeric regions themsevles are simple arrays of microsatelites, and as such, it is generally not possible to align reads or call SNPs in these regions.

What I don't really know of the top of my head is how large you'd expect these uncalled regions to be.

ADD COMMENT • link 2 days ago by i.sudbery 22k

0

Entering edit mode

Thanks for your reply. For real, I am not very clear on what the typical size of regions without called SNPs in GWAS should be. I am more concerned that having almost no SNPs in the centromeric regions might indicate a technical issue with SNP calling, filtering, or genome assembly.

ADD REPLY • link 2 days ago by straywilddog • 0