Locating IBD candidates with just VCF files
1
0
Entering edit mode
16 months ago

Hello!

I have 11 individuals, WGS VCF files. They are related within three or so generations, and I'm trying to find IBDs between them (stretches of DNA (haplotypes) that are identical because they're related, 10^5-10^7 in length). My current method is to define IBD candidates as an unbroken series of (adjacent) SNPs in the VCF files that share at least one allele among all (or some chosen subset) of my 11 individuals.

I am getting strange results though. If I start by including only, say, 7 of the individuals, I get some number of IBD candidates. If I then add individuals, the number of IBD candidates should only go down, but it goes up sometimes... nonsensical to be sure. Any idea of what's going wrong here?

My VCFs aren't phased, sadly, which is why I must use the "at least one allele in common" condition.

Very thankful for any thoughts or advice!

Joel

Relatedness Genotypes IBD Haplotypes VCF • 1.3k views
ADD COMMENT
0
Entering edit mode

Hard to say without knowing exactly what code you've used. Have you tried using some kind of off the shelf IBD algorithm like hap-IBD? I know you said your data is unphased, but it might be worth phasing it first (if it's an organism which can be phased).

ADD REPLY
0
Entering edit mode

I made an inventory of various IBD softwares yeah, but many of them require phasing first which I tried to avoid, or they seemed shady (some ten years old Master's thesis project and the like). The data is human, so probably/maybe can be phased with imputation but mmm... I'd need stretches of 10^5-10^6 bp phased. Is that even possible without a parents+child trio setup? I have mostly cousins or "worse".

ADD REPLY
0
Entering edit mode

Yes I would strongly recommend phasing, you are likely to get much more accurate IBD detection results. You don't need trios, just a reference panel. You could use something like TopMed or Sanger or Michigan server to phase the data. What kind of data do you have originally - is it genotype?

ADD REPLY
0
Entering edit mode

I have genotyped variants in standard VCF format, yeah. Will those tools also report confidence of the phasing? I tried phasing myself once, with BEAGLE I think, and the phasing results were essentially random every time I ran it, so 0 confidence...

Uhm I just realised my data is sensitive, I can't send it to online tools for analysis :(

ADD REPLY
0
Entering edit mode

Did you use a reference panel? If you didn't then yes trying to phase 11 individuals without one is going to give you pretty rubbish results, you need at least a few hundred sample to get anything decent. TopMed is pretty secure (https://topmedimpute.readthedocs.io/en/stable/data-sensitivity/) but appreciate that there are rules against seeing some e.g. clinical samples.

ADD REPLY
0
Entering edit mode

Clinical samples are exactly what we have, I'm afraid. As I recall I did use a reference panel, from 1kg. Where would I find experts on phasing, for further questions? I've taken enough of your time :)

ADD REPLY
1
Entering edit mode

I'm not quite an expert but I've done a fair amount of phasing in my time, so ask away. I would suggest using the 1000 genomes 30x and beagle 5. Let me know if you have issues.

ADD REPLY
3
Entering edit mode
8 weeks ago

For those who Google this thread, in the end I used BEAGLE for phasing, with 1000 Genomes reference panel, and then hap-ibd to find IBDs. I needed home-made scripts to transform the hap-ibd output to answer my original question (e.g., are there haplotype segments shared between individuals X, Y, and Z, but also NOT shared with individuals I or J). There was some variability in the output but large enough true IBDs were almost always detected.

ADD COMMENT

Login before adding your answer.

Traffic: 2891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6