Selection detection software
2
3
Entering edit mode
7.7 years ago
eyb ▴ 230

I have a set 680k SNP for 141 individuals. Can anyone suggest a software to detect selection?

At the moment I am looking at gpat++ and rehh. Both of them require to specify a region, chromosome and position e.g. 1-1000 for gpat++ and chromosome for rehh (only ~1500 SNPs per chromosome in example). Are there any programs which allow to calculate e.g. iHS for a whole chromosome with large number of SNPs?

selection SNP • 5.2k views
ADD COMMENT
3
Entering edit mode
7.6 years ago
JMR ▴ 160

There's a new, very user friendly implementation of most EHH test that you might be interested in.

Here's the link: https://github.com/szpiech/selscan

From the manual:

selscan -- a program to calculate EHH-based scans for positive selection in 
  genomes

Copyright (C) 2014  Zachary A Szpiech

selscan currently implements EHH, iHS, and XP-EHH, and requires phased data. 
It should be run separately for each chromosome and population (or population 
pair for XP-EHH).  selscan is 'dumb' with respect ancestral/derived coding and
simply expects haplotype data to be coded 0/1.  Unstandardized iHS scores are 
thus reported as ln(iHH1/iHH0) based on the coding you have provided.
ADD COMMENT
0
Entering edit mode

Yea, I've already did this.

ADD REPLY
0
Entering edit mode

Hi, I also used selscan to do some EHH based test on my dataset. iHS scores look OK but using norm the program in selscan to standardize the values, they were different compared to the values using my own script. I also used the rehh program and iHS scores look different. Any thoughts?

ADD REPLY
1
Entering edit mode
7.7 years ago

Greetings,

GPAT's iHS and XPEHH can handle whole chromosomes without issue. I have used it on phase 3 1kg data.

INFO: optional: r,region     -- argument: a tabix compliant genomic range: seqid or seqid:start-end

so you can pass region as --region chr1 or --region chr1:1-100 or don't pass anything and it will go chromosome by chromosome.

ADD COMMENT
0
Entering edit mode

Hi! Thanks alot for the help. I am having troubles running iHS from GPAT++. My VCF looks like this:

##fileformat=VCFv4.1
##filedate=20140925
##source="b4.jar (r1274)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    0_6203511131_R06C02    0_6203503155_R01C01    0_6203503155_R01C02
1    82154    rs4477212    A    .    .    PASS    AR2=0;DR2=0    GT:DS:GP    0|0:0:1    0|0:0:1    0|0:0:1
1    752566    rs3094315    A    G    .    PASS    AR2=1;DR2=1;AF=0.19    GT:DS:GP    0|0:0:1,0,0    0|1:1:0,1,0    0|1:1:0,1,0
1    752721    rs3131972    G    A    .    PASS    AR2=1;DR2=1;AF=0.211    GT:DS:GP    0|1:1:0,1,0    0|1:1:0,1,0    0|1:1:0,1,0
1    798959    rs11240777    G    A    .    PASS    AR2=1;DR2=1;AF=0.254    GT:DS:GP    0|1:1:0,1,0    0|0:0:1,0,0    1|1:2:0,0,1
1    800007    rs6681049    G    A    .    PASS    AR2=1;DR2=1;AF=0.032    GT:DS:GP    0|1:1:0,1,0    0|0:0:1,0,0    1|0:1:0,1,0
1    838555    rs4970383    C    A    .    PASS    AR2=1;DR2=1;AF=0.387    GT:DS:GP    0|0:0:1,0,0    0|1:1:0,1,0    0|0:0:1,0,0
1    846808    rs4475691    G    A    .    PASS    AR2=1;DR2=1;AF=0.215    GT:DS:GP    0|0:0:1,0,0    0|0:0:1,0,0    0|1:1:0,1,0
1    854250    rs7537756    A    G    .    PASS    AR2=1;DR2=1;AF=0.246    GT:DS:GP    0|0:0:1,0,0    0|0:0:1,0,0    0|1:1:0,1,0
1    861808    rs13302982    G    A    .    PASS    AR2=1;DR2=1;AF=0.127    GT:DS:GP    0|1:1:0,1,0    0|1:1:0,1,0    1|0:1:0,1,0
1    873558    rs1110052    A    C    .    PASS    AR2=1;DR2=1;AF=0.44    GT:DS:GP    0|1:1:0,1,0    0|1:1:0,1,0 

I've tried to run it:

./iHS --target 0,1,2,3,4,5 --file ~/Desktop/Komi_Ijma/ki_phased.vcf --region chr1

But I got error message:

INFO: target ids: 0,1,2,3,4,5
INFO: file: /home/laba/Desktop/Komi_Ijma/ki_phased.vcf
INFO: set seqid region to : chr1 FATAL: failed to specify genotype likelihood format : PL or GL

I then specified likelihood format:

./iHS --target 0,1,2,3,4,5 --file ~/Desktop/Komi_Ijma/ki_phased.vcf --region chr1 --type GP

INFO: there are 6 individuals in the target
INFO: target ids: 0,1,2,3,4,5
INFO: file: /home/laba/Desktop/Komi_Ijma/ki_phased.vcf
INFO: set seqid region to : chr1 cannot setRegion on a non-tabix indexed file

The question is, what am I missing? How can I index my file using tabix?

ADD REPLY
0
Entering edit mode

Have a look at the tabix manual. Tabix is an indexing scheme that allows you to query regions easily. You need to:

  1. compress your vcf: bgzip your.vcf; and
  2. tabix -p vcf your.vcf
ADD REPLY
0
Entering edit mode

I have run tabix on my vcf, but I am getting a new error now:

./iHS --target 0,1,2,3,4,5 --file ~/Desktop/Komi_Ijma/ki_phased.vcf.gz.tbi --region chr4 --type GP
INFO: there are 6 individuals in the target
INFO: target ids: 0,1,2,3,4,5
INFO: file: /home/laba/Desktop/Komi_Ijma/ki_phased.vcf.gz.tbi
INFO: set seqid region to : chr4
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 0)
Aborted

What is wrong with an input file?

ADD REPLY
0
Entering edit mode

That appears to be a bug in my code. If you would be kind enough to send the smallest piece of your VCF file that generates the error I will fix it today.

ADD REPLY
0
Entering edit mode

I extracted first 50000 lines from my vcf file, but I was not able to reproduce the bug. The terminal output:

INFO: there are 6 individuals in the target
INFO: target ids: 0,1,2,3,4,5
INFO: file: /home/laba/Desktop/tabix-0.2.6/ki_zev.vcf.gz.tbi
INFO: set seqid region to : chr4
error: no VCF header

.gz.tbi file: https://www.sendspace.com/file/4b50vo

.gz file: https://www.sendspace.com/file/m3dpsr

I am not quite sure how to subset to reproduce. Can you advice?

ADD REPLY
0
Entering edit mode

I made some changes that should fix the problem. You will need to update GPAT.

ADD REPLY
0
Entering edit mode

eyb did that work for you?

ADD REPLY
0
Entering edit mode

Thanks for the fix. I did not try it yet. I will report on monday, when I will get back to work =)

ADD REPLY
0
Entering edit mode

Hi! I've tried to run updated version today, but I get no VCF header error.

./iHS --target 0,1,2,3,4,5,6 --file ~/Desktop/vcflib/tabixpp/ki_phased.vcf.gz.tbi --region chr1 --type GP

INFO: there are 7 individuals in the target
INFO: target ids: 0,1,2,3,4,5,6
INFO: file: /home/laba/Desktop/vcflib/tabixpp/ki_phased.vcf.gz.tbi
INFO: set seqid region to : chr1
error: no VCF header

Header of my VCF file looks like this:

##fileformat=VCFv4.1
##filedate=20140925
##source="b4.jar (r1274)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    0_6203511131_R06C02

What does it missing?

ADD REPLY
0
Entering edit mode

I've also tried to run it on .vcf file and I got the following error message:

./iHS --target 0,1,3 --file ~/Desktop/vcflib/tabixpp/ki_phased.vcf --type GP --region chr1
INFO: there are 3 individuals in the target
INFO: target ids: 0,1,3
INFO: file: /home/laba/Desktop/vcflib/tabixpp/ki_phased.vcf
INFO: set seqid region to : chr1
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 0)
Aborted

Should I iHS on vcf or tbi file?

ADD REPLY
0
Entering edit mode

Looks like the same error as before

ADD REPLY
0
Entering edit mode

Did you update and re-make the code?

ADD REPLY
1
Entering edit mode

So to summarize,

Program works fine with sample files provided, but when I try to run it on my data I get segmentation fault error. No matter if I run it on vcf or gz file. I also tried to run the program on reduced file, 50000 lines and then 20 columns, but it didnt help as I am still gettin segmentation fault error.

ADD REPLY
0
Entering edit mode

I understand the frustration. I'm more than happy to fix this error if you can give me a file which will generate the error. My email is listed on my profile. You would be doing me a favor helping me find the bug.

The code works on a wide variety of VCF files, but obviously I have overlooked something.

--Zev

ADD REPLY
0
Entering edit mode

Hi there!

Your email is hidden, but I uploaded files which generate the same error as my original dataset (segmentation fault). One contains first 50k lines, and another is reduced to 20 columns.

50k file: https://www.sendspace.com/file/fpi85q

20 columns file: https://www.sendspace.com/file/h2n99z

PS Let me know if you need an original.

ADD REPLY
0
Entering edit mode

Yes. I redownloaded the whole thing and ran make. I also checked it with make update.

ADD REPLY
0
Entering edit mode

So this morning I gave it another shot:

  1. I have updated the code
  2. I compressed my phased file with bgzip
  3. I've ran tabix on the compressed file.
  4. I've ran iHS on the compressed file and I am getting:

    ./iHS --target 0,1,2,3,4,5,6,7,8,9,10 --file ~/Desktop/ki_sel/ki_phased.vcf.gz --region chr1 --type GP
    INFO: there are 11 individuals in the target
    INFO: target ids: 0,1,2,3,4,5,6,7,8,9,10
    INFO: file: /home/laba/Desktop/ki_sel/ki_phased.vcf.gz
    INFO: set seqid region to : chr1
    FATAL: unable to set region
    

    tbi file is in the same directory as .gz file

  5. If I unpack compressed vcf file and run iHS on vcf I get:

    cannot setRegion on a non-tabix indexed file OR segmentation fault
    

    error if I change my working directory from vcflib/bin to the location of my vcf files.

Can you please help?

ADD REPLY
0
Entering edit mode

Also I am not sure if I did it yesterday, but if I am providing non-existing file it gives

terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 0)
Aborted

error

ADD REPLY
0
Entering edit mode

I am also getting a segmentation fault error if I run iHS on gz file with tbi file lying in the same directory.

ADD REPLY
0
Entering edit mode

Meanwhile iHS works fine with samples provided in vcflib/samples directory.

ADD REPLY

Login before adding your answer.

Traffic: 1311 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6