Forum: A Database Of Signatures Of Selection In The 1000 Genomes Dataset
28
gravatar for Giovanni M Dall'Olio
5.0 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

The 1000 Genomes Selection Browser is a database of Signatures of Selection in the Human Genome, based on the 1000 Genomes Phase I data. It is freely accessible at http://hsb.upf.edu/

The browser, based on a custom UCSC Genome Browser installment, allows to easily navigate the genome and visualize regions that are candidate for having been involved in an event of selection in any of the African, European, or Asian populations. The data can also be easily downloaded for further analysis here.

Our browser includes a total of 17 tests for selection. For each test of selection, we provide a raw score, plus a ranked score which compares each position to the rest of the genome.

  1. Tajima’s D (Tajima, 1989): Comparison of estimates of the number of segregating sites and the mean pairwise difference between sequences.

  2. CLR (Nielsen et al., 2005). Multilocus Composite Likelihood Ratio Test. Read more about the CLR

  3. Fay and Wu’s H (Fay & Wu, 2000): Comparison of the number of derived segregating sites at low and high frequencies and the number of variants at intermediate frequencies.

  4. Fu and Li’s F* (Fu, 1997): Comparison of the number of singleton mutations and the mean pairwise difference between sequences.

  5. Fu and Li’s D* (Fu, 1997): Comparison of the number of singleton mutations and the total number of nucleotide variants.

  6. R2 (Ramos-Onsins and Rozas. 2002) Comparison of the difference between the number of singletons per sequence and the average number of nucleotide differences.

  7. XP-EHH (Sabeti et al., 2007): Cross-population extended haplotype homozygosity.

  8. Delta iHH (Voight et al., 2006, Grossman et al., 2010): difference between two integrated haplotype homozygosity scores.

  9. iHS (Voight et al., 2006): log ratio between two integrated haplotype homozygosity scores.

  10. EHH average (Sabeti et al., 2002): Extended halotype homozygosity; weighted average for all core haplotypes of the position at which the haplotype homozygosity decays to <=0.5.

  11. Wall’s B (Wall, 2000): Counts the number of pairs of adjacent segregating sites that are congruent (if the subset of the data consisting of the two sites contains only two different haplotypes)

  12. Wall’s Q (Wall, 2000): Adds the number of partitions (two disjoint subsets whose union is the set of individuals in the sample) induced by congruent pairs to Wall’s B.

  13. Fu’s Fs (Fu, 1997): Based on Ewens’ sampling distribution, taking into account the number of different haplotypes in the sample.

  14. Dh (Nei, 1987): Summary statistic based on the number of different haplotypes in the sample

  15. Fst (Weir and Cockerham, 1984) : global and pairwise

  16. delta DAF: difference of Derived allele frequencies between 2 populations.

  17. XP-CLR (Chen et al., 2010): Multilocus allele frequency differentiation between two populations.

The database has been published in the NAR Database issue 2014:

Pybus M, Dall'olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P,
Laayouni H, Bertranpetit J, Engelken J. 1000 Genomes Selection Browser 1.0: a
genome browser dedicated to signatures of natural selection in modern humans.
Nucleic Acids Res. 2014 Jan 1;42(1):D903-9. doi: 10.1093/nar/gkt1188. Epub 2013
Nov 25. PubMed PMID: 24275494. Available at http://nar.oxfordjournals.org/content/42/D1/D903.short

For completeness we also also link dbPSHP, a database of curated publications about positive selection in different human populations, which also presents the results of 15 tests for positive selection.

1000genomes forum selection human • 9.9k views
ADD COMMENTlink modified 2.7 years ago by mwhart0 • written 5.0 years ago by Giovanni M Dall'Olio26k
1

Cool resource and nice post describing it!

ADD REPLYlink written 5.0 years ago by Obi Griffith17k
1
gravatar for Dan Gaston
5.0 years ago by
Dan Gaston7.1k
Canada
Dan Gaston7.1k wrote:

Awesome! I can already think of some interesting things I'd like to test out with this data!

ADD COMMENTlink written 5.0 years ago by Dan Gaston7.1k

I am glad that you liked it :-) Feel free to ask any question you may have, to me or in the website.

ADD REPLYlink written 5.0 years ago by Giovanni M Dall'Olio26k
1
gravatar for Rubal7
5.0 years ago by
Rubal7750
Rubal7750 wrote:

Great resource!

ADD COMMENTlink written 5.0 years ago by Rubal7750
1
gravatar for yfwangbm
4.2 years ago by
yfwangbm10
Hong Kong
yfwangbm10 wrote:

Hi Giovanni, it is a great job. And I check the databases and tried to download the data for iHS score, but some question I am not quite clear. 1. Why the score here are all positive? 2. this is unstandardised or normalized iHS score, and how to do the normalization?  In addition, I tried to find the corresponding genetic distance for 1000 genome variants, I saw you it was mentioned in the paper, so how to add the genetic map? 

 

ADD COMMENTlink written 4.2 years ago by yfwangbm10

iHS scores are usually given as absolute values, hence all positives. The negative or positive value that this statistics can give will also depend on the equation used. On the original Voight et al (2006) paper, negative values indicated selection at derived alleles for instance. Also unstardardised iHS values are largely useless, they should always be corrected by the allele frequencies, standardised iHS values are usually reported.

ADD REPLYlink written 4.1 years ago by JMR140

From the supplemental information:  Raw scores from ΔiHH, iHS and XP-EHH were standardized in bins of derived allele frequency (step size of 0.05) using the respective genome-wide distribution for each statistic to capture signal from ancestral SNPs that have hitchhiked to high frequency along with a selected derived variant, absolute standardized iHS scores were chosen as the end result (24). 

ADD REPLYlink written 4.1 years ago by JMR140
1
gravatar for Zev.Kronenberg
3.4 years ago by
United States
Zev.Kronenberg11k wrote:

I'm having trouble dumping CEU vs CHB XP-EHH.  The tables are limited to a chromosome?

Also, I really dig the boosting.

ADD COMMENTlink written 3.4 years ago by Zev.Kronenberg11k
1

Hi Zev,
are you downloading the file from the Table Browser? I think there is a limit on the number of rows that can be downloaded from there. For example I downloaded the whole file, and towards the end I see an error message saying "Reached output limit of 100000 data values".

The best way to get the data is to download them from this folder: http://hsb.upf.edu/hsb_data/positive_selection_NAR2013. The files contain both the scores and the log(pvalue).

 

ADD REPLYlink written 3.4 years ago by Giovanni M Dall'Olio26k

perfect. thank you very much.

ADD REPLYlink written 3.4 years ago by Zev.Kronenberg11k

One last question, these data are hg19?  I ask because i know the original xpehh was hg18.

ADD REPLYlink written 3.4 years ago by Zev.Kronenberg11k
1

Yes, everything is hg19. Feel free to ask as many questions you need ;-)

ADD REPLYlink written 3.4 years ago by Giovanni M Dall'Olio26k

well, since you offered.  The "p-values" for the CEU-YRI are not bounded by zero and one.  Are they Z-scores?  I'm trying to get the joint probability of XP-EHH and DN/DS.

Thanks.

 

ADD REPLYlink written 3.4 years ago by Zev.Kronenberg11k
1

Hi Zev,

the p-values are simply the -log10 of the fraction of SNPs with an higher score. For example in R, using dplyr:

> xpehh = read.table('XPEHH_CEU_vs_CHB.whole_genome.pvalues', header=T, stringsAsFactors=F, colClasses=c('character','integer', 'numeric', 'numeric', 'numeric'))
> xpehh %>% 
    arrange(desc(score)) %>%   # Sort SNPs by XPEHH score (descending)
    mutate(
       rank=row_number(),           # number of SNPs with higher scores
       rank.perc=rank/n(),          # fraction of SNPs with higher score
       rank.log=-log10(rank.perc)   # P-value
    )
         snpID chromosome position    score   pvalue  rank    rank.perc rank.log
         (chr)      (int)    (dbl)    (dbl)    (dbl) (int)        (dbl)    (dbl)
1  rs116972803         15 48377866 8.018417 7.133374     1 7.355728e-08 7.133374
2   rs77517214         15 48377764 8.018099 6.832344     2 1.471146e-07 6.832344
3   rs75870250         15 48376241 7.942488 6.656253     3 2.206718e-07 6.656253
4  rs150960840         15 48379200 7.925932 6.531314     4 2.942291e-07 6.531314

So, I guess that yes, our p-values could be actually called Z-scores, sorry about the confusion :-)

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Giovanni M Dall'Olio26k

Giovanni M Dall'Olio I've computed genome-wide iHS for the phase III One Thousand genomes project.  Any interest in adding the data to the browser?  It only took thousands of CPU hours ;-).

ADD REPLYlink written 3.1 years ago by Zev.Kronenberg11k

Hi Zev, I think it would be amazing! Let me ask back to my colleagues at UPF to see how it can be done.

ADD REPLYlink written 3.1 years ago by Giovanni M Dall'Olio26k

Hey there, Any word on the phase III iHS values?

ADD REPLYlink written 20 months ago by jglassbrook0
1
gravatar for Krisr
3.2 years ago by
Krisr460
United States
Krisr460 wrote:

Hi,

 

Great resource, thanks!  

I have a quick question.  Is there a way to determine an arbitrary FDR, say of 2%, or 2.5%, or 5% (instead of the default 1% in the downloads) for the boosting results?   Or, are there other values/methods to lessen relax the significance thresholds derived from these datasets?   So if the CEU Complete boosting threshold at 1% FDR is 0.40199.   What would it be at 2.5% FDR.  Would this require rerunning the analysis, or can a post threshold be determined.  Thanks again for this resource!

ADD COMMENTlink written 3.2 years ago by Krisr460
0
gravatar for Pierre
3.4 years ago by
Pierre480
Spain
Pierre480 wrote:

Glad that you dig the boosting.
About your question, I just tried it and could visualize all chromosomes. What problem do you face exactly? Is it with track visualization or tables?

Cheers

ADD COMMENTlink written 3.4 years ago by Pierre480

The viewing is fine.  I'm trying to export the genome-wide XPEHH for CEU_VS_CHB.  Every time I download the table it only has chr1. 

ADD REPLYlink written 3.4 years ago by Zev.Kronenberg11k
0
gravatar for mwhart
2.7 years ago by
mwhart0
mwhart0 wrote:

Hi, Giovanni. Thanks this is a great tool. I'm using it to look at some unlinked SNPs in genes with epistatic interactions. One of the SNPs is rs2906999; in hg19 it should be at chr7:76069811. But in several tests focused on the interval around that SNP (iHS, Fst) that coordinate does not appear among the sites with a test statistic. That site is polymorphic in the 1000 Genomes phase 1 data, and has a high minor allele frequency, so I expected it to show up in the test results (the other SNPs I am analyzing do appear in the test results).

I am having trouble figuring out where that one missing SNP might be. Can you help me figure that out? I realize that some rare polymorphisms were filtered out in developing the database, but these are common polymorphisms. Thanks for any help or suggestions (and I hope you are still monitoring this Biostars thread).

Cheers!

ADD COMMENTlink written 2.7 years ago by mwhart0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1625 users visited in the last hour