Question: filter out related individuals from a vcf file using Plink or Hail
gravatar for  DataFanatic
9 months ago by
DataFanatic100 wrote:

I have a data set for 11,000 individuals and would like to filter out related individuals and also based on a phenotype. I have a vcf file format and prefer to work with Plink, Hail or vcftools?

ADD COMMENTlink modified 9 months ago • written 9 months ago by DataFanatic100
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe25k
USA / Europe / Brazil
Kevin Blighe25k wrote:

You have 11,000 individuals in a VCF file? - that's impressive. VCFTools does have a relatedness function that it calculates pairwise for all samples. It generally seems to work fine, as I tried it on data that contained monozygotic twins once; however, in the context of false-negatives and false-positives that plague NGS, not even biologicaly monozygotic twins look the same after their DNA has passed through a 'next gen' sequencer and analysis pipeline.

I would favour getting the data into PLINK Format, which provides analyses across all sorts of things including relatedness. To do that, first gzip and tab-index your VCF and then normalise it:

/Programs/bcftools-1.3.1/bcftools norm -Ou -m-any My.gz.vcf | /Programs/bcftools-1.3.1/bcftools norm -Ou --check-ref w -f human_g1k_v37.fasta | /Programs/bcftools-1.3.1/bcftools annotate -Ob -I +'%ID' > My.Norm.bcf ;

This will keep the VCF ID field as it is (assuming it's rs ID?). If you want to set it to something unique, use:

/Programs/bcftools-1.3.1/bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT'

To read this into PLINK:

/Programs/plink1.90/plink --noweb --bcf My.Norm.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyPlinkData ;

It's critical when reading VCF/BCF into PLINK that you fix the order of the samples, because in virtually all cases PLINK will not order the samples as they appear in the VCF/BCF. You do this by specifying --indiv-sort file IDsort.list. IDsort.list contains a listing of FID and IID. It's easier to keep FID as 0 here:

0 16422

0 16427

0 16451

0 17412

0 17422

0 17413

If you have families in your data and want to do family-specific analysis, you'll have to reheader your VCF/BCF to have sample IDs as FID_IID, change --const-fid above to --id-delim _, and also modify the IDsort file:

Family1 16422

Family1 16427

Family2 16451

Family2 17412

Family3 17422

Family3 17413

Based on your ID sort file, you should also create a custom FAM file in which the samples are ordered the same. When you read data from VCF/BCF into PLINK, it won't know he gender or case control status.

In all analyses, you then just include your custom FAM as follows:

#Perform family TDT comparisons
/Programs/plink1.90/plink --noweb --bfile MyPlinkData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam ;

Other people will hopefully have other suggestions.

ADD COMMENTlink modified 28 days ago • written 9 months ago by Kevin Blighe25k
gravatar for  DataFanatic
9 months ago by
DataFanatic100 wrote:

Kevin thank you very much, however; what I want to do is to: 1.Remove related individuals 2.Remove individuals with a specific phenotype for which I have a txt file. If I understand correctly your code changes vcf to plink files only right? How can I filter out individuals as in 1,2.? Also, I don't have families I would be grateful four your advice.

ADD COMMENTlink modified 9 months ago • written 9 months ago by DataFanatic100

Hello, it would have been better to have replied as a comment to my answer / Mejor hubiera sido escrito su comentario a mi respuesta encima.

You mentioned Plink in your question, so I presumed that you already knew how to use Plink. If you get your dataset in Plink format (with my code), you can then identify related individuals by following the procedures here:

For removing (or keeping) certain individuals, take a look here:

If you want to avoid Plink, just use BCFtools to remove individuals (if you already know who you want to keep/remove):

/Programs/bcftools-1.3.1/bcftools view -Ov --samples-file SamplesToKeep.list MyData.vcf.gz > MyData.Filtered.vcf

SamplesToKeep.list just contains a single column of IDs that you want to keep

ADD REPLYlink written 9 months ago by Kevin Blighe25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour