Question: Detection Of Snps In Proximity Of Manhattan Plot Peaks
gravatar for Agatha
7.0 years ago by
Agatha340 wrote:

I am currently interested in finding the SNPs that are in the proximity of all the peaks in a Manhattan plot (not only the genome wide significant ones) Basically for each peak, I would like to detect 2, 3 annotated SNPs on both sides(+/- 3 SNP positions) . I have also been looking Find The Extent Of A Peak In A Manhattan Plot which is partially what I am trying to achieve as well.

I have a list of SNPs (200 000)(defined by chromosome and position) and their associated P-value. I have used the data to generate the Manhattan plots. The data is resulted after a Chip experiment with DNA from cancer patients and healthy controls. I have detected the significant variants wich might be associated with cancer, but I am also trying to see if in the vicinity of those SNPs there are other variants with a known association to cancer. I would like to do this for the detected genome wide significant SNPs and for the other observed peaks.

snp • 3.5k views
ADD COMMENTlink modified 7.0 years ago by Arun2.3k • written 7.0 years ago by Agatha340

Please add some context. What biological problem are you working on? Manhattan plots can represent many different types of data.

ADD REPLYlink written 7.0 years ago by Zev.Kronenberg11k

Also, how do you generate the Manhattan plots? Are you talking about Manhattan plots that you have generated by yourself, or plots generated by somebody else?

ADD REPLYlink written 7.0 years ago by Giovanni M Dall'Olio26k

I have edited my question...I hope this explains more clearly what I'd like to do.

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Agatha340

When you mean peaks, these are the SNPs for which you get a significant p-value? And when you mean on both sides, you mean, +/- 3 SNP positions? And what do you mean by detect 2-3 annotated SNPs? If you get a peak (and you know its location), then you already can calculate +/- 3 positions... right?

ADD REPLYlink written 7.0 years ago by Arun2.3k

@Arun - yes I can detect the ones located near the genome wide significant ones because they are generally a few. I dont know what can I do with the other observed peaks.. Well some of the SNPs might be novel, which is why I would like to validate them somehow by using other already known ones located +/- 3 positions near the "interesting" SNP. Also I am not sure if this is the way to do it...

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Agatha340

So, basically you have a list of positions on the genome (corresponding to the SNPs that are significant in your study), and you want to know if there are other SNPs in the proximity of these positions. Is this correct?

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Giovanni M Dall'Olio26k

yes, for the genome wide significant ones; I don't know how to detect the peaks with a lower significance(still lower than 0.001) to distinguish them from the non relevant ones...setting thresholds is not a solution I think because it would return too many...

ADD REPLYlink modified 7.0 years ago • written 7.0 years ago by Agatha340

LD, haplotype analysis?

ADD REPLYlink written 7.0 years ago by zx87549.0k

I will do that after I will be able to identify my "peaks"

ADD REPLYlink written 7.0 years ago by Agatha340
gravatar for Arun
7.0 years ago by
Arun2.3k wrote:

Not quite sure if this answers your question entirely, but this is a start. You can do this in R with the help of the bioconductor package IRanges or even better GenomicRanges (it uses IRanges).

I'll generate some random data of SNPs that are significant. I'll assume you've 5 chromosomes and a total of 150 significant SNPs

# 1) dummy significant snp data:
chr <- rep(paste0("chr", 1:5), (1:5)*10)
pos <- unlist(lapply(1:5, function(x) sort(sample(1:1e4, 10*x))))
pval <- sample(seq(1e-5, 1e-3, length.out=1000), 150, replace=T)

# significant snps
sig.snps <- data.frame(chr, pos, pval)

# 2) dummy other snps data:
# no: of insignificant SNPs according to your threshold will be generated randomly.
pos <- lapply(1:5, function(x) sample(setdiff(1:1e4, sig.snps$pos[sig.snps$chr == paste0("chr", x)]), sample(1000:6000, 1)))
chr <- rep(paste0("chr", 1:5), sapply(pos, length))
pos <- unlist(pos)
pval <- sample(seq(.0011, 0.5, length.out=1e4), length(pos), replace=T)
insig.snps <- data.frame(chr, pos, pval)

Actual answer starts here: Now using GenomicRanges to get non-significant SNPs that lie close to each of the significant SNPs within a given threshold:

require(GenomicRanges) <- GRanges(Rle(sig.snps$chr), IRanges(sig.snps$pos, width=1)) <- GRanges(Rle(insig.snps$chr), IRanges(insig.snps$pos, width=1))

threshold <- 100 # find all insignificant snps for each significant SNP that lies +/- 100 bases apart

olaps <-,, maxgap=threshold, type="any")

# all insignificant snps that overlap with any significant snp within given threshold    
out <- insig.snps[subjectHits(olaps), ]

# contains a list of data.frames equal to the length of significant snps
# each element of the list contains all insignificant snps that overlap with that significant snp
out.l <- split(out, sig.snps$pos[queryHits(olaps)])

out.f <- ldply(out.l)

> head(out.f)
  .id  chr pos      pval
1  49 chr5  15 0.4576392
2  49 chr5  16 0.4798923
3  49 chr5  35 0.2895429
4  49 chr5 133 0.3823476
5  49 chr5  98 0.3736160
6  49 chr5 117 0.2939836

.id gives the position of significant SNP and the rest of the column are the corresponding insignificant SNPs that fall within your threshold size.

From here, you can filter based on pval as well, if you wish. I'll leave it for you to work it out.

ADD COMMENTlink written 7.0 years ago by Arun2.3k

@Arun This is a great start for what I have in mind. Thank you.

ADD REPLYlink written 7.0 years ago by Agatha340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1100 users visited in the last hour