Question: How to phase a WGS dataset preserving the indels within it?
gravatar for Shab86
4.8 years ago by
Shab86270 wrote:

Hi all,

I have a WGS dataset in which indels were also called. Now I am looking towards creating this a reference dataset for the local population I am working on. But before using this as a reference set for imputation of Exome data for samples from this population, I would need to phase this WGS dataset.

The problem is that SHAPEIT and MACh can't handle indels within the files and usually have to be removed before phasing. My query is how do I phase the WGS dataset but also preserving the indels within it.

Any help is greatly appreciated!

sequencing phasing snp indel genome • 1.6k views
ADD COMMENTlink modified 4.8 years ago by piet1.8k • written 4.8 years ago by Shab86270
gravatar for piet
4.8 years ago by
planet earth
piet1.8k wrote:

Insertions and deletions is a concept, which only matters, if you compare two sequences. Technically, you cannot have indels within a single sequence string. You may have an annotation on a sequence which tells you that a particular region is an insertion (eg not present in other sequences from the same species).

You may refer to having gaps in your sequence. Gaps are usually written as '-'. You should squeeze them out, if an application does not allow them. For example, it is forbidden to submit sequences with gaps to Genbank.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by piet1.8k

Thanks for your reply. But if I do remove them then how would I be able to impute them in my genotyped samples? This way I would loose the indels which were called specifically earlier. And also, since I wouldn't be able to impute them in my samples then I can't use it for any downstream analysis also.

ADD REPLYlink written 4.8 years ago by Shab86270

Nucleic acid molecules do not comprise any gap residues. Thus a sequence representing a real molecule must not have gaps. Make a copy of your gapped sequence before you delete the gaps.

ADD REPLYlink written 4.8 years ago by piet1.8k

Piet, I assume that your explanation doesn't concern diploid organisms? Because for phasing, you are effectively comparing two sequences, both alleles. It's a bit annoying indeed that you have to remove indels for phasing, but I understand it's an additional complexity the makers of the tool would like to avoid. What I would consider (and which is a bad workaround, which will likely come around and hurt you unexpectedly) is to substitute all indel variants for an artificial SNP variant. You keep the position on which you did this nasty trick and replace the real indels afterwards. Not sure on what your phasing algorithm is based. Doesn't hurt to try? It might just screw up everything but in that case we're smarter next time a similar problem arises.

ADD REPLYlink written 4.8 years ago by WouterDeCoster45k

That's an idea WouterDeCoster ! I know it will mess up with the phasing algorithm of shapeit2 in regards to haplotype estimation maybe. I will try this and see if I get something meaningful when I phase them. But it's strange that shapeit2 refuses to take indels even now when even the latest 1kg ref dataset has indels in them.

ADD REPLYlink written 4.8 years ago by Shab86270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1910 users visited in the last hour