Perl/Python script: phased vcf to phased tped
2
0
Entering edit mode
4.2 years ago
Shicheng Guo ★ 9.2k

Hi All,

Who can share a perl/python script to transfer phased vcf to phased tped?

Thanks.

Update: plink will re-order the alleles therefore 'phase' status will be broken if plink was used in the data processing. Thanks for the explanation to it: the order in which the alleles appear in heterozygous genotype calls is usually determined by which allele is major/minor in the immediate dataset; this ordering will not vary between samples

vcf ped phased • 2.3k views
3
Entering edit mode

2
Entering edit mode

This is the correct 'answer'. If you care about representing genotype phase in text, use VCF.

0
Entering edit mode

Yes. I think there should be some wheels outside.

0
Entering edit mode

Wheels? That word does not make sense in this context. Could you explain using different words, maybe?

0
Entering edit mode

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype


PED files do not hold genotype, phased or not, information. Are you sure you're asking the right question?

0
Entering edit mode

PED files do hold genotypes, see https://www.cog-genomics.org/plink2/formats#ped

0
Entering edit mode

There are two prevalent PED formats - the one used/generated by plink has genotype information after the first six columns. The subset of this file with the first six columns alone is used in other tools, such as GATK's PhaseByTransmission, etc, and is the more prevalent one for clinical genetics usage. PLINK calls this format the .fam file.

0
Entering edit mode

Oops! Got confused with the .fam files. Thanks for the info!

5
Entering edit mode
4.2 years ago

plink 1.9's core only handles .bed files. So --vcf causes a temporary .bed file to be generated, which does not contain any phase information. When a ped/tped is then exported from the .bed, the order in which the alleles appear in heterozygous genotype calls is usually determined by which allele is major/minor in the immediate dataset; this ordering will not vary between samples, and has nothing to do with the original phase status.

1
Entering edit mode

Moved this comment to an answer to make it clearer that there is incorrect advice in other answers.

0
Entering edit mode

Great. Thanks Chris. Now. I see. That means plink changes the orders to keep the code for each individual is like same with in minor/major allele. Are you share about the vcftools --tped is same as what you said? Thanks.

1
Entering edit mode

It's basically irrelevant what vcftools --tped does, because phase is undefined in the tped format. You're effectively inventing your own file format and can't count on any software support from anyone else; much better to just write software that understands VCF, if you have to deal with text.

0
Entering edit mode

Hi Chris, I think I will keep my post. I think my post is correct. Hope you can give further suggestion. I test it use 1000 genome data and use diff chr22.vcf.vcf.tped chr22.vcf.pl.tped to check the whole chr22. and it is totally same.

0
Entering edit mode

Could it be that just your example works by coincidence, but that the implementation (which chrchang523 obviously knows better than anyone else) does not guarantee phase information is preserved?

0
Entering edit mode

it should be not coincidence, the whole chr22 is totally allele order (phase status) in the tped compared with vcf. Let's wait for chrchang523's further comments. We will be the destination soon.

0
Entering edit mode

Part 2 is the one that matters, and I have already explained why that can't possibly work and your test must be flawed. plink is open source, and it is straightforward to verify that (i) .bed does not store phase info and (ii) the implementation of --recode only uses (temporary) .bed as input.

If you do not edit your answer within 24 hours, I will delete it.

0
Entering edit mode

Okay. I respect your suggestion and removed plink part. Just keep the 'vcftools --tped' part.

0
Entering edit mode

okay, maybe we can delete the whole post.

0
Entering edit mode

If you aren't going to delete the post, you need to explicitly mention that the plink test failed, after debugging your test if need be. It's the vcftools result that is meaningless, and can be deleted with no loss to anyone, since .tped is a plink file format; that's why the vcftools flag is called --plink-tped.

0
Entering edit mode
4.2 years ago
Shicheng Guo ★ 9.2k

Done. Just Share with you guys. I conducted a test on 1000 Genome chr22.

1. transfer phased vcf to tped, the tped will keep the phase status, rigtht? Yes. it keeps the phased status

vcftools --vcf test.vcf --plink-tped --out out

2. use plink to creat tped, failed, yes. plink will re-order the alleles

plink --vcf test.vcf --tped --out out

2
Entering edit mode

This is incorrect, and you should mark it as such.

0
Entering edit mode

Hi Chris, Can you show us some details when you coding the plinks to convert vcf to tped? Thanks. At least, from my small test dataset, I found the phase status is kept. However, it will be great if you can tell us some details about the plink when you coding. Thanks.

Let's take tped as example, since in the ped, it will be easy to shown.

1
Entering edit mode

Did it work?

Thought plink could take VCF as input --vcf, --bcf?

1
Entering edit mode

This doesn't work; Shicheng's test was faulty.

0
Entering edit mode

Yes. I test it, it works. plink can take --vcf and --bcf as input. But I just want to get phased status and do some further analysis with R which I hope to take 'phased ped' as input. As chris said any files created by plink will remove phase status.

0
Entering edit mode

I have cleaned up this thread. It is good that everyone can share their opinion here, but I hope we can start fresh from now.