how to add pedigree info into VCF
1
1
Entering edit mode
8.1 years ago
CrazyB ▴ 280

My VCF does not contain pedigree info and as such when running PLINK, it does not compute. VCFtool can convert VCF to .ped and .map but my understanding is that it does not require a pedigree for the conversion and assumes all individuals as unrelated (?). I googled around but did not find any tool that can add the pedigree info ("PLINK-grade" pedigree info). Any suggestion? Great many thanks.

vcf ped pedigree • 4.0k views
ADD COMMENT
0
Entering edit mode
8.1 years ago
Ram 43k

A VCF file stores information on variants and it stores their state per individual. It does not store any information on the relationship between individuals because storing them in the body would be unnecessarily redundant and I have not seen headers store this information.

Unless the header has information, it will be impossible to infer relatedness from a VCF file. You should go back to the initial stages of data in the project and match the samples to the individuals, get their affected status, sex etc and generate the pedigree yourself.

ADD COMMENT
2
Entering edit mode

That isn't strictly correct - the VCF specification does describes the PEDIGREE header tag for storing inter-sample relationships. The RTG joint variant callers that utilize inter-sample relationships (e.g. tumor-normal calling, or germline calling with families or larger pedigrees) do output these headers and it can be very useful to check from a header the exact pedigree which was used during calling (in larger pedigrees it is fairly common to discover/correct errors in pedigree). RTG includes subcommands pedfilter and pedstats that let you do simple conversion between the VCF representation of pedigree and PED files, although I am not sure whether these meet the need of the OP for use with PLINK.

ADD REPLY
0
Entering edit mode

Thank you for that piece of information. I have not seen a VCF store pedigree data, so this is new to me. Can you maybe show me the formatting used by the RTG callers for the pedigree info in the header? Also, is there any option for GATK or samtools to include pedigree info in VCF? I for one would not mind having that option.

ADD REPLY
1
Entering edit mode

For a run of rtg somatic, the information regarding the tumor and normal samples is represented like this (here a run of a dream challenge dataset):

##SAMPLE=<ID=synthetic.challenge.set2.normal,Genomes=synthetic.challenge.set2.normal,Mixture=1.0,Description="Original genome">
##SAMPLE=<ID=background.synth.challenge2.snvs.svs.tumorbackground,Genomes=synthetic.challenge.set2.normal;background.synth.challenge2.snvs.svs.tumorbackground,Mixture=0.20;0.80,Description="Original genome;Derived genome">
##PEDIGREE=<Derived=background.synth.challenge2.snvs.svs.tumorbackground,Original=synthetic.challenge.set2.normal>

When sample sex information is available (e.g. as used by our sex-aware variant calling), the sex is stored in the SAMPLE header. So, for something like an octet from the CEPH pedigree when called with rtg population it looks like:

##SAMPLE=<ID=NA12877-1,Sex=MALE>
##SAMPLE=<ID=NA12878,Sex=FEMALE>
##SAMPLE=<ID=NA12880-1,Sex=FEMALE>
##SAMPLE=<ID=NA12883,Sex=MALE>
##SAMPLE=<ID=NA12889,Sex=MALE>
##SAMPLE=<ID=NA12890,Sex=FEMALE>
##SAMPLE=<ID=NA12891-1,Sex=MALE>
##SAMPLE=<ID=NA12892-1,Sex=FEMALE>
##PEDIGREE=<Child=NA12880-1,Mother=NA12878,Father=NA12877-1>
##PEDIGREE=<Child=NA12883,Mother=NA12878,Father=NA12877-1>
##PEDIGREE=<Child=NA12877-1,Mother=NA12890,Father=NA12889>
##PEDIGREE=<Child=NA12878,Mother=NA12892-1,Father=NA12891-1>

And to convert to PED:

 $ rtg pedfilter octet-ped.vcf.gz
# PED format pedigree
# fam-id        ind-id  pat-id  mat-id  sex     phen
0       NA12877-1       NA12889 NA12890 1       0
0       NA12878 NA12891-1       NA12892-1       2       0
0       NA12880-1       NA12877-1       NA12878 2       0
0       NA12883 NA12877-1       NA12878 1       0
0       NA12889 0       0       1       0
0       NA12890 0       0       2       0
0       NA12891-1       0       0       1       0
0       NA12892-1       0       0       2       0

and here is round-tripping to a minimal VCF header:

$ rtg pedfilter octet-ped.vcf.gz | rtg pedfilter --vcf -
##fileformat=VCFv4.1
##fileDate=20160328
##source=RTG Core 3.6.2 / Core 1d2e108 (2016-03-11)
##CL=pedfilter --vcf -
##RUN-ID=847e4106-e6c3-414a-b848-fb03880b5546
##SAMPLE=<ID=NA12877-1,Sex=MALE>
##SAMPLE=<ID=NA12878,Sex=FEMALE>
##SAMPLE=<ID=NA12880-1,Sex=FEMALE>
##SAMPLE=<ID=NA12883,Sex=MALE>
##SAMPLE=<ID=NA12889,Sex=MALE>
##SAMPLE=<ID=NA12890,Sex=FEMALE>
##SAMPLE=<ID=NA12891-1,Sex=MALE>
##SAMPLE=<ID=NA12892-1,Sex=FEMALE>
##PEDIGREE=<Child=NA12880-1,Mother=NA12878,Father=NA12877-1>
##PEDIGREE=<Child=NA12883,Mother=NA12878,Father=NA12877-1>
##PEDIGREE=<Child=NA12877-1,Mother=NA12890,Father=NA12889>
##PEDIGREE=<Child=NA12878,Mother=NA12892-1,Father=NA12891-1>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12877-1       NA12878 NA12880-1       NA12883 NA12889 NA12890 NA12891-1       NA12892-1

Sorry, I don't know whether GATK or samtools support this type of thing. (I would just say use our callers instead :-))

Just now some googling indicated that for the upcoming VCF 4.3 spec the format may change slightly, but it doesn't look like a biggie: https://github.com/samtools/hts-specs/issues/96

ADD REPLY
0
Entering edit mode

Thank you. I think GATK and samtools will support this if it becomes part of the specs. Until then, I guess I'll have to store pedigrees in PED files :)

ADD REPLY

Login before adding your answer.

Traffic: 2020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6