You have 11,000 individuals in a VCF file? - that's impressive. VCFTools does have a relatedness function that it calculates pairwise for all samples. It generally seems to work fine, as I tried it on data that contained monozygotic twins once; however, in the context of false-negatives and false-positives that plague NGS, not even biologicaly monozygotic twins look the same after their DNA has passed through a 'next gen' sequencer and analysis pipeline.
I would favour getting the data into PLINK Format, which provides analyses across all sorts of things including relatedness. To do that, first gzip and tab-index your VCF and then normalise it:
/Programs/bcftools-1.3.1/bcftools norm -Ou -m-any My.gz.vcf | /Programs/bcftools-1.3.1/bcftools norm -Ou --check-ref w -f human_g1k_v37.fasta | /Programs/bcftools-1.3.1/bcftools annotate -Ob -I +'%ID' > My.Norm.bcf ;
This will keep the VCF ID field as it is (assuming it's rs ID?). If you want to set it to something unique, use:
/Programs/bcftools-1.3.1/bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT'
To read this into PLINK:
/Programs/plink1.90/plink --noweb --bcf My.Norm.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyPlinkData ;
It's critical when reading VCF/BCF into PLINK that you fix the order of the samples, because in virtually all cases PLINK will not order the samples as they appear in the VCF/BCF. You do this by specifying
--indiv-sort file IDsort.list. IDsort.list contains a listing of FID and IID. It's easier to keep FID as 0 here:
If you have families in your data and want to do family-specific analysis, you'll have to reheader your VCF/BCF to have sample IDs as FID_IID, change
--const-fid above to
--id-delim _, and also modify the IDsort file:
Based on your ID sort file, you should also create a custom FAM file in which the samples are ordered the same. When you read data from VCF/BCF into PLINK, it won't know he gender or case control status.
In all analyses, you then just include your custom FAM as follows:
#Perform family TDT comparisons /Programs/plink1.90/plink --noweb --bfile MyPlinkData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam ;
Other people will hopefully have other suggestions.
Kevin thank you very much, however; what I want to do is to: 1.Remove related individuals 2.Remove individuals with a specific phenotype for which I have a txt file. If I understand correctly your code changes vcf to plink files only right? How can I filter out individuals as in 1,2.? Also, I don't have families I would be grateful four your advice.