Question: Filter out related individuals from a vcf file using Plink or Hail
gravatar for  DataFanatic
2.1 years ago by
DataFanatic140 wrote:

I have a data set for 11,000 individuals and would like to filter out related individuals and also based on a phenotype.

I have a vcf file format and prefer to work with Plink, Hail or vcftools?

plink hail vcf • 1.5k views
ADD COMMENTlink modified 11 months ago by zx87548.7k • written 2.1 years ago by DataFanatic140

Kevin thank you very much, however; what I want to do is to: 1.Remove related individuals 2.Remove individuals with a specific phenotype for which I have a txt file. If I understand correctly your code changes vcf to plink files only right? How can I filter out individuals as in 1,2.? Also, I don't have families I would be grateful four your advice.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by DataFanatic140

Hello, it would have been better to have replied as a comment to my answer / Mejor hubiera sido escrito su comentario a mi respuesta encima.

You mentioned Plink in your question, so I presumed that you already knew how to use Plink. If you get your dataset in Plink format (with my code), you can then identify related individuals by following the procedures here:

For removing (or keeping) certain individuals, take a look here:

If you want to avoid Plink, just use BCFtools to remove individuals (if you already know who you want to keep/remove):

/Programs/bcftools-1.3.1/bcftools view -Ov --samples-file SamplesToKeep.list MyData.vcf.gz > MyData.Filtered.vcf

SamplesToKeep.list just contains a single column of IDs that you want to keep

ADD REPLYlink written 2.1 years ago by Kevin Blighe52k
gravatar for Kevin Blighe
2.1 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

You have 11,000 individuals in a VCF file? - that's impressive. VCFTools does have a relatedness function that it calculates pairwise for all samples. It generally seems to work fine, as I tried it on data that contained monozygotic twins once; however, in the context of false-negatives and false-positives that plague NGS, not even biologicaly monozygotic twins look the same after their DNA has passed through a 'next gen' sequencer and analysis pipeline.

I would favour getting the data into PLINK Format, which provides analyses across all sorts of things including relatedness. To do that, first gzip and tab-index your VCF and then normalise it:

#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
/Programs/bcftools-1.3.1/bcftools norm -m-any My.gz.vcf | /Programs/bcftools-1.3.1/bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > My.Norm.bcf ;

This will keep the VCF ID field as it is (assuming it's rs ID?). If you want to set it to something unique, use:

/Programs/bcftools-1.3.1/bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT'

To read this into PLINK:

/Programs/plink1.90/plink --noweb --bcf My.Norm.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyPlinkData ;

It's critical when reading VCF/BCF into PLINK that you fix the order of the samples, because in virtually all cases PLINK will not order the samples as they appear in the VCF/BCF. You do this by specifying --indiv-sort file IDsort.list. IDsort.list contains a listing of FID and IID. It's easier to keep FID as 0 here:

0 16422

0 16427

0 16451

0 17412

0 17422

0 17413

If you have families in your data and want to do family-specific analysis, you'll have to reheader your VCF/BCF to have sample IDs as FID_IID, change --const-fid above to --id-delim _, and also modify the IDsort file:

Family1 16422

Family1 16427

Family2 16451

Family2 17412

Family3 17422

Family3 17413

Based on your ID sort file, you should also create a custom FAM file in which the samples are ordered the same. When you read data from VCF/BCF into PLINK, it won't know he gender or case control status.

In all analyses, you then just include your custom FAM as follows:

#Perform family TDT comparisons
/Programs/plink1.90/plink --noweb --bfile MyPlinkData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam ;

Other people will hopefully have other suggestions.

ADD COMMENTlink modified 15 months ago • written 2.1 years ago by Kevin Blighe52k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour